Starting from:

$30

Assignment 2 Find frequent itemsets in two datasets

INF553 Foundations and Applications of Data Mining

Assignment 2

1. Overview of the Assignment
In this assignment, you will implement the SON algorithm using the Apache Spark Framework.
You will develop a program to find frequent itemsets in two datasets, one simulated dataset
and one real-world dataset generated from Yelp dataset. The goal of this assignment is to
apply the algorithms you have learned in class on large datasets more efficiently in a
distributed environment.
2. Requirements
2.1 Programming Requirements
a. You must use Python to implement all tasks. There will be 10% bonus for each task if you
also submit a Scala implementation and both your Python and Scala implementations are
correct.
b. You are required to only use Spark RDD in order to understand Spark operations more
deeply. You will not get any point if you use Spark DataFrame or DataSet.
2.2 Programming Environment
Python 3.6, Scala 2.11 and Spark 2.3.2
We will use Vocareum to automatically run and grade your submission. We highly
recommend that you first test your script on your local machine and then submit to
Vocareum.
2.3 Write your own code
Do not share code with other students!!
For this assignment to be an effective learning experience, you must write your own code! We
emphasize this point because you will be able to find Python implementations of some of the
required functions on the web. Please do not look for or at any such code!
TAs will combine all the code we can find from the web (e.g., Github) as well as other
students’ code from this and other (previous) sections for plagiarism detection. We will report
all detected plagiarism.
2.4 What you need to turn in
a. Three Python scripts, named: (all lowercase): task1.py, task2.py, preprocess.py
b1. [OPTIONAL] two Scala scripts, named: (all lowercase)
task1.scala, task2.scala (No need to write preprocessing code in Scala)
b2. [OPTIONAL] one jar package, named: hw2.jar (all lowercase)
Note. You don’t need to include your output files. We will grade on your code with our testing
data (data will be in the same format).
3. Datasets
In this assignment, you will use one simulated dataset and one real-world dataset. In task 1,
you will build and test your program with a small simulated CSV file that has been provided to
you.
For task 2, you need to generate a subset using business.json and review.json from the Yelp
dataset (https://drive.google.com/drive/folders/1-Y4H0vw2rRIjByDdGcsEuor9VagDyzin?
usp=sharing) with the same structure as the simulated data. Figure 1 shows the file structure,
the first column is user_id and the second column is business_id. In task2, you will test your
code with this real-world data.
Figure 1: Input Data Format
We will only provide submission report for small1.csv on Vocareum for task 1. No
submission report will be provided for task2. You are encouraged to used command line to
run the code for small2.csv as well as for task2 to get a sense of the running time.
4. Tasks
In this assignment, you will implement the SON algorithm to solve all tasks (Task 1 and 2) on
top of Apache Spark Framework. You need to find all the possible combinations of the
frequent itemsets in any given input file within the required time. You can refer to Chapter 6
from the Mining of Massive Datasets book and concentrate on section 6.4 – Limited-Pass
Algorithms. (Hint: you can choose either A-Priori, MultiHash, or PCY algorithm to process each
chunk of the data)
4.1 Task 1: Simulated data (4 pts)
There are two CSV files (small1.csv and small2.csv) provided on the Vocareum in your
workspace. The small1.csv is just a sample file that you can used to debug your code. For
task1, we will test your code on small2.csv for grading.
In this task, you need to build two kinds of market-basket model.
Case 1 (2 pts):
You will calculate the combinations of frequent businesses (as singletons, pairs, triples, etc.)
that are qualified as frequent given a support threshold. You need to create a basket for each
user containing the business ids reviewed by this user. If a business was reviewed more than
once by a reviewer, we consider this product was rated only once. More specifically, the
business ids within each basket are unique. The generated baskets are similar to:
user1: [business11, business12, business13, ...]
user2: [business21, business22, business23, ...]
user3: [business31, business32, business33, ...]
Case 2 (2 pts):
You will calculate the combinations of frequent users (as singletons, pairs, triples, etc.) that
are qualified as frequent given a support threshold. You need to create a basket for each
business containing the user ids that commented on this business. Similar to case 1, the user
ids within each basket are unique. The generated baskets are similar to:
business1: [user11, user12, user13, ...]
business2: [user21, user22, user23, ...]
business3: [user31, user32, user33, ...]
Input format:
1. Case number: Integer that specifies the case. 1 for Case 1 and 2 for Case 2.
2. Support: Integer that defines the minimum count to qualify as a frequent itemset.
3. Input file path: This is the path to the input file including path, file name and extension.
4. Output file path: This is the path to the output file including path, file name and
extension.
Output format:
1. Runtime: the total execution time from loading the file till finishing writing the output
file
You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Output file:
(1) Output-1
You should use “Candidates:”as the tag. For each line you should output the candidates of
frequent itemsets you find after the first pass of SON algorithm, followed by an empty line
after each fequent-X itemset combination list. The printed itemsets must be sorted in
lexicographical order. (Both user_id and business_id have the data type “string”.)
(2) Output-2
You should use “Frequent Itemsets:”as the tag. For each line you should output the final
frequent itemsets you found after finishing the SON algorithm. The format is the same with
the Output-1. The printed itemsets must be sorted in lexicographical order.
Here is an example of the output file: 
Both the output-1 result and output-2 should be saved in ONE output result file
“firstname_lastname_task1.txt”.
Execution example:
Python:
spark-submit firstname_lastname_task1.py <case number <support <input_file_path
<output_file_path
Scala:
spark-submit –class firstname_lastname_task1 firstname_lastname_hw2.jar <case number
<support <input_file_path <output_file_path
4.2 Task 2: Yelp data (4 pts)
In task2, you will explore the Yelp dataset to find the frequent business sets (only case 1).
You will jointly use the business.json and review.json to generate the input user-business CSV
file yourselves.
(1) Data preprocessing
You need to generate a sample dataset from business.json and review.json (https://
drive.google.com/drive/folders/1-Y4H0vw2rRIjByDdGcsEuor9VagDyzin?usp=sharing) with
following steps:
1. The state of the business you need is Nevada, i.e., filtering ‘state’== ‘NV’.
2. Select “user_id” and “business_id” from review.json whose “business_id” is from Nevada.
Each line in the CSV file would be “user_id1, business_id1”.
3. The header of CSV file should be “user_id,business_id”
You need to save the dataset in CSV format. Figure 3 shows an example of the output file
Figure 2: user_business file
You need to submit the code and the output file of this data preprocessing step. The
preprocessing code will NOT be graded. We will use different filters to generate another
dataset for grading.
(2) Apply SON algorithm
The requirements for task 2 are similar to task 1. However, you will test your implementation
with the large dataset you just generated. For this purpose, you need to report the total
execution time. For this execution time, we take into account also the time from reading the
file till writing the results to the output file. You are asked to find the frequent business sets
(only case 1) from the file you just generated. The following are the steps you need to do:
1. Reading the user_business CSV file in to RDD and then build the case 1 market-basket
model;
2. Find out qualified users who reviewed more than k businesses. (k is the filter threshold);
3. Apply the SON algorithm code to the filtered market-basket model;
Input format:
1. Filter threshold: Integer that is used to filter out qualified users
2. Support: Integer that defines the minimum count to qualify as a frequent itemset.
3. Input file path: This is the path to the input file including path, file name and extension.
4. Output file path: This is the path to the output file including path, file name and
extension.
Output format:
1. Runtime: the total execution time from loading the file till finishing writing the output
file
You need to print the runtime in the console with the “Duration” tag, e.g., “Duration: 100”.
2. Output file
The output file format is the same with task 1. Both the intermediate results and final results
should be saved in ONE output result file “firstname_lastname_task2.txt”.
Execution example:
Python:
spark-submit firstname_lastname_task2.py <filter threshold <support <input_file_path
<output_file_path
Scala:
spark-submit –class firstname_lastname_task2 firstname_lastname_hw2.jar <filter threshold
<support <input_file_path <output_file_path
5. Evaluation Metric
Task 1: 
Task 2:
6. Grading Criteria
(% penalty = % penalty of possible points you get)
1. You can use your free 5-day extension separately or together, and you need to email your
TA to indicate that you are using the free days within 24 hours of your submission.
2. There will be 10% bonus for each task (i.e., 0.3 pts, 0.2 pts, 0.3 pts) if your Scala
implementations are correct. Only when your Python results are correct, the bonus of using
Scala will be calculated. There is no partial point for Scala.
3. There will be no point if your programs cannot be executed on Vocareum
Please start your assignment early! You can resubmit on Vocareum. We will grade your last
submission.
4. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade
your assignments if there is a grading error. No exceptions.
5. There will be 20% penalty for the late submission within a week and no point after a week.
6. There will be no point if the total execution time exceeds Section 6 evaluation metric
7. If the outputs of your program are unsorted or partially sorted, there will be 50% penalty.
Input File Case Support Runtime (sec)
small2.csv 1 4 <=200
small2.csv 2 9 <=200
Input File Filter Threshold Support Runtime (sec)
user_business.csv 70 50 <=2,000

More products