Data Mining - Association Rules

Introduction

One of the reasons behind maintaining any database is to enable the user to find interesting patterns and trends in the data. For example, in a supermarket, the user can figure out which items are being sold most frequently. But this is not the only type of `trend' which one can possibly think of. The goal of database mining is to automate this process of finding interesting patterns and trends. Once this information is available, we can perhaps get rid of the original database. The output of the data-mining process should be a "summary" of the database. This goal is difficult to achieve due to the vagueness associated with the term `interesting'. The solution is to define various types of trends and to look for only those trends in the database. One such type constitutes the association rule.

In the rest of the discussion, we shall assume the supermarket example, where each record or tuple consists of the items of a single purchase. However the concepts are applicable in a large number of situations.

In the present context, an association rule tells us about the association between two or more items. For example: In 80% of the cases when people buy bread, they also buy milk. This tells us of the association between bread and milk. We represent it as -

bread => milk | 80%

This should be read as - "Bread means or implies milk, 80% of the time." Here 80% is the "confidence factor" of the rule.

Association rules can be between more than 2 items. For example -

bread, milk => jam | 60%

bread => milk, jam | 40%

Given any rule, we can easily find its confidence. For example, for the rule

bread, milk => jam

we count the number say n₁, of records that contain bread and milk. Of these, how many contain jam as well? Let this be n₂. Then required confidence is n₂/n₁.

This means that the user has to guess which rule is interesting and ask for its confidence. But our goal was to "automatically" find all interesting rules. This is going to be difficult because the database is bound to be very large. We might have to go through the entire database many times to find all interesting rules.

Brute Force

The common-sense approach to solving this problem is as follows -

Let I = { i₁, i₂, ..., i_n } be a set of items, also called as an itemset. The number of times, this itemset appears in the database is called its "support". Note that we can speak about support of an itemset and confidence of a rule. The other combinations - support of a rule and confidence of an itemset are not defined.

Now, if we know the support of `I' and all its subsets, we can calculate the confidence of all rules which involve these items. For example, the confidence of the rule

i₁, i₂, i₃ => i₄, i₅

	
	   support of { i₁, i₂, i₃, i₄, i₅ }
	is ___________________________________
	       support of { i₁, i₂, i₃ }

So, the easiest approach would be to let `I' contain all items in the supermarket. Then setup a counter for every subset of `I' to count all its occurances in the database. At the end of one pass of the database, we would have all those counts and we can find the confidence of all rules. Then select the most "interesting" rules based on their confidence factors. How easy.

The problem with this approach is that, normally `I' will contain atleast about 100 items. This means that it can have 2¹⁰⁰ subsets. We will need to maintain that many counters. If each counter is a single byte, then about 10²⁰ GB will be required. Clearly this can't be done.

Minimum Support

To make the problem tractable, we introduce the concept of minimum support. The user has to specify this parameter - let us call it minsupport. Then any rule

i₁, i₂, ... , i_n => j₁, j₂, ... , j_n

needs to be considered, only if the set of all items in this rule which is { i₁, i₂, ... , i_n, j₁, j₂, ... , j_n } has support greater than minsupport.

The idea is that in the rule

bread, milk => jam

if the number of people buying bread, milk and jam together is very small, then this rule is hardly worth consideration (even if it has high confidence).

Our problem now becomes - Find all rules that have a given minimum confidence and involves itemsets whose support is more than minsupport. Clearly, once we know the supports of all these itemsets, we can easily determine the rules and their confidences. Hence we need to concentrate on the problem of finding all itemsets which have minimum support. We call such itemsets as frequent itemsets.

Some Properties of Frequent Itemsets

The methods used to find frequent itemsets are based on the following properties -

Every subset of a frequent itemset is also frequent. Algorithms make use of this property in the following way - we need not find the count of an itemset, if all its subsets are not frequent. So, we can first find the counts of some short itemsets in one pass of the database. Then consider longer and longer itemsets in subsequent passes. When we consider a long itemset, we can make sure that all its subsets are frequent. This can be done because we already have the counts of all those subsets in previous passes.
Let us divide the tuples of the database into partitions, not necessarily of equal size. Then an itemset can be frequent only if it is frequent in atleast one partition. This property enables us to apply divide and conquer type algorithms. We can divide the database into partitions and find the frequent itemsets in each partition. An itemset can be frequent only if it is frequent in atleast one of these partitions. To see that this is true, consider k partitions of sizes n₁, n₂,..., n_k.
Let minimum support be s.
Consider an itemset which does not have minimum support in any partition. Then its count in each partition must be less than sn₁, sn₂,..., sn_k respectively.
Therefore its total count must be less than the sum of all these counts, which is s( n₁ + n₂ +...+ n_k ).
This is equal to s*(size of database).
Hence the itemset is not frequent in the entire database.

This Page is Under Construction

Comments/Suggestions? Contact: vikram AT iiit.net

Vikram Pudi