$29.99
CENG 514 Data Mining
Assignment I
In this assignment we will work on Association Rule Mining and elaborate on two related sub
tasks.
Q1. Analyzing the pruning through candidate generation.
The pseudocode for the candidate generation step of Apriori algorithm is as follows:
Given a set of k-item frequent itemsets I as input
(Note that k>1 and items within itemsets and also itemsets are
already sorted in increasing order).
C is the set of k-item Candidate Itemsets, initially empty
For each itemset i in I
For each itemset j in I
If the initial k-1 elements of itemset i and j are the
same,
create a k+1-item candidate itemset c
If c does not include any k-item non-frequent
itemset
C = C U c
Output C
Example:
Input {a,b}, {a,c}, {a,d}, {b,c}, {c,d}, {c,e}, {d,e}
Output should be {a,b,c}, {a,c,d},{c,d,e}
Note that {a,b} and {a,d} can be joined, as their initial k elements are the same, to form the
candidate {a,b,d}. However it includes the {b,d} as a subset, which is a non-frequent 2-item
itemset.
Implement the below algorithms in Python:
i. The candidate generation algorithm above
ii. Candidate generation algorithm by extending each k-item frequent itemset by an
item in the data set
We expect to obtain pruning in candidates in the first algorithm. In order to analyze this:
i. Run both of the algorithms on 50 randomly created k-item frequent itemsets I.
(Use the same input for both of the algorithms.)
ii. Report the reduction in the number of candidates generated per run and the
average reduction.
Note that the number of reduction in candidate generation will be calculated as given in
Equation 1.
𝑅𝑒𝑑𝑢𝑐𝑡𝑖𝑜𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑖𝑛 𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 1
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑠 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑖 𝐴𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 2
(1)
Submission: Codes and the report presenting the analysis result.
Q2. Analyzing the Partitioning (SON) Algorithm.
As discussed in the class, SON algorithm partitions the transaction data set into chunks fitting
in the memory and returns the complete result in two passes on the data. In this part of the
assignment, you will simulate the first pass of the algorithm and compare the result with that of
the original Apriori.
Follow the given steps:
i. For the analysis, use groceries.csv data set
ii. Run apriori function under mlxtend library in Python on groceries.csv data set under
0.02 support ratio.
iii. For i in 5 to 10
a. Partition groceries.csv data set into i partitions
b. Find the total number of frequent itemsets generated for all the partitions
c. Compare the result with that of Step (ii) and report as a ratio given in Equation
2.
𝑂𝑣𝑒𝑟 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑖𝑜𝑛 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑖𝑡𝑒𝑚𝑠𝑒𝑡𝑠 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑓𝑖𝑟𝑠𝑡 𝑝𝑎𝑠𝑠 𝑜𝑓 𝑆𝑂𝑁
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑖𝑡𝑒𝑚𝑠 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑖𝑛 𝑡ℎ𝑒 𝑜𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐴𝑝𝑟𝑖𝑜𝑟𝑖
(2)
Note that apriori in mlxtend gets transaction data input in the form of one hot encoding. So you
will need to convert each row in the original data set to a list, and then to one hot encoding
format. You can use TransactionEncoder under mlxtend.preprocessing.
Submission: Codes and the report presenting the analysis result.