Foundations and Applications of Data Mining Assignment 1
Inf553 – Foundations and Applications of Data Mining Assignment 1
1 Overview of the Assignment In this assignment, students will complete two tasks. The goal of these two tasks is to let students get familiar with Spark and perform data analysis using Spark. In the assignment description, the first part is about how to configure the environment and data sets, the second part describes the two tasks in details, and the third part is about the files the students should submit and the grading criteria. 2 Environment Configuration Please use Spark 2.2.1 with Hadoop 2.7 for this assignment. 2.1 Spark Installation Spark can be downloaded from the official website (refer to: Spark Page ) The interface of Spark official website is shown in the following figure. Figure 1: The Interface of Spark Official Website 1 2.2 Python Configuration You need to add the paths of your Spark (path/to/your/Spark) and Python (path/to/your/ Spark/python) folders to the interpreter’s environment variables named as SPARK HOME and PYTHONPATH, respectively. 2.3 Scala Installation You can use Intellij if you prefer IDE for creating and debugging projects. And install Scala/SBT plugins for Intellij. You can refer to the tutorial ”Setting UP Spark 2.0 environment on intellij community edition”. 2.4 Environment Requirements Python: 2.7 Scala: 2.11 Spark: 2.2.1 Student must use python to complete both Task1 and Task2. There will be 10% bonus if you also use Scala for both Task1 and Task2 (i.e. 10 - 11; 9 - 9.9). IMPORTANT: We will use these versions to compile and test your code. If you use other versions, there will be a 20% penalty since we will not be able to grade it automatically. 2.5 Write your own code! For this assignment to be an effective learning experience, you must write your own code! I emphasize this point because you will be able to find Python implementations of most or perhaps even all of the required functions on the web. Please do not look for or at any such code! Do not share code with other students in the class!! TA will combine some python code on Github which can be searched by keyword ”INF553” and every students’ code, using some software tool for detecting Plagiarism. Write your own code. 2.6 Data Please download the data from MovieLen over the following link: MovieLen You are required to download two data sets. The first is ml-20m.zip, which size is 190MB, the second is ml-latest-small.zip, which size is 1MB. Each zip file contains five CSV files. The files tags.csv and ratings.csv are needed for the tasks. 2 3 Task 1: (40%) Students are required to calculate each movie’s average rating. The ratings.CSV file is needed for this task. 3.1 Result format 1. Save the result as one csv file with header (movieId, rating avg). 2. The result is ordering by movieId in ascending order. The following snapshot is an example of result for task 1. It just shows the format of the result. Figure 2: Example of Result for Task 1 3.2 Execution Example The program that you will implement should take two parameters as input and generate one file as an output. The first parameter must be the location of the ratings.csv file and the second one must be the path to the output file followed by the name of the output file. Java/Scala Execution Example Please use Task1 as class name Figure 3: Example of Command Line for Task 1 3 4 Task 2: (60%) Students are required to calculate the average rating of each tag. Both the rating.csv and tags.csv files are required for this task. 4.1 Result format 1. Save the result as one csv file with header (tag, rating avg). 2. The result is ordering by tag in descending order. The following two snapshots is an example of result for task 2. The unreadable codes in the first snapshots are because encoding problem. It just shows the format of the result. In the second picture, the data is sorted by first column in descending order. Figure 4: Example of Result for Task 2 Figure 5: Example of Result for Task 2 4 4.2 Execution Example The program that you will implement should take three parameters as input and generate one file as an output. The first two parameter must be the location of the ratings.csv file and tags.csv file and the third one must be the path to the output file followed by the name of the output file. Java/Scala Execution Example Please use Task2 as class name Figure 6: Example of Command Line for Task 2 4.3 Hints for Task2 1. You can create Dataframe objects and save the Dataframe objects as csv file. 2. You can learn more about Dataframe by this link: DataFrames Submission Details Your submission must be a .zip file with name: Firstname Lastname hw1.zip. The structure of your submission should be identical as shown below. The Firstname Lastname Description.pdf file contains helpful instructions on how to run your code and other need informations. The OutputFiles directory contains the deliverable output files for each problem and the Solution directory contains your source code. Figure 7: Submission Structure 5 What you need to turn in 1. Source codes for two tasks (Python or Scala) and name it respectively as Firstname Lastname task1 Firstname Lastname task2 (For example, Yuanbin Cheng task1.py) 2. Result files of two tasks for large and small data sets and name it as Firstname Lastname result task1 big.csv Firstname Lastname result task2 big.csv Firstname Lastname result task1 small.csv Firstname Lastname result task2 small.csv 3. Documents: please describe how to run your program in this document. 4. If you use Scala, please submit the jar package as well and name them as Firstname Lastname hw1.jar 5. Zip the above files and name it as Firstname Lastname hw1.zip Grading Criteria 1. Your codes will be run according to your Readme file. If your programs cannot be run with the commands you provide, your submission will be graded based on the result files you submit and 80% penalty for it. 2. If the file generated by your program is unsorted, there will be 20% penalty. 3. If your program does not use the required Scala/Python/Spark versions, there will be 20% penalty. 4. If your program generates more than one file, there will be 20% penalty. 5. If the csv file generated has more than two columns, there will be 20% penalty. 6. If the header of the csv file is missing, there will be 10% penalty. 7. The deadline for assignment 1 is 05/28 midnight. There will be 20% penalty for late submission within a week and 0 grade after a week. 8. You can use your free 5-day extension. 9. There will be 10% bonus if you use both Scala and python for the entire assignment. 6