Association Mining

For this association mining exercise, the ’Appriori’ algorithm is used to make the associations. In order not to have too many association rules to delineate, we restricted ourselves to a minimum of Support of \(45~\%\) (the maximum support in this dataset being \(50~\%\), this substantially reduced the itemsets) and and a minimum of Confidence of \(80~\%\) (the maximum Confidence for associations having Support bigger or equal to \(45~\%\) being \(91~\%\)).

One should be aware that the cars dataset contains many non binary variables. Therefore, such variables were binarized (using the median of a variable as pivot for the attribution of zeros and ones) previous to the use of the ’Appriori’ algorithm.

Presentation of Itemsets and association rules results

Table \ref{ItemsetsCarsDataTable} shows one all the items and combination of items having a support (i.e. frequency in this case) bigger or equal to \(45~\%\). This means for instance that the item ’WheelBase’ (i.e. cars ’Wheelbase’ that are bigger than the median of the Wheelbase variable) is present in at least \(50~\%\) of our observations.

It is interesting to note that all the items present in table \ref{ItemsetsCarsDataTable} are part of the binarized variables. None of the initial dummy variables have frequencies as big as \(45~\%\), which is a particularly high threshold in practice. On the other hand, it makes perfect sens that most of the binarized variables are above the \(45~\%\) threshold. Indeed, by binarizing using the median as pivot, the resulting frequencies of the variables should all be around \(50~\%\) if no ties, and therefore no random classification as ones and zeros, were observed.

Table \ref{AssocRulesCarsDataTable} portrays the association rules resulting from the application of the ’Appriori’ algorithm. Columns ’X1 name’ and ’X2 name’ depict the items presented in Table \ref{ItemsetsCarsDataTable} (i.e. the ones with an initial sufficient support) and column ’Y name’ represents the items which, when associated with the items presented in ’X1’ and ’X2’, still have a Confidence of at least \(80~\%\) (i.e. that Y is present at least \(80~\%\) of the time when ’X1’ and ’X2’ are present).