$30
CSCI 5408 Assignment 2
Assignment 2
** Late submissions are not accepted and will result in a 0 on the assignment
Objective:
This assignment covers concepts related to BigData and NoSQL, and research phase of a data
management project. Consider this assignment as the second phase of an industry project. The
designed bigdata framework and data gathered in this assignment will be used in the next
assignment.
Grading Scheme:
• Spark Setup: 20%
• Twitter Data Extraction & Transformation: 25%
• News Article Data Extraction & Transformation: 20%
• Data Processing: 30%
• Adding citation in IEEE/ACM Format only. Use reliable information source: 5%
Hypothetical Scenario:
A Company, "Analytics-5408" is trying to establish its position in the business domain in
Canada. Recently it hired you as an information specialist to work for its client (Dalhousie
University). You designed the Data Models, and gathered some data for the client. Now, the
project has extended, and you started working on a content management system (CMS), which
will have information related to Education in Canada. The project has two components,
(1) Data management, and
(2) Business Intelligence
In this phase, the project focuses on implementing a BigData infrastructure and processing
data extracted from Twitter, and NEWS API. The company believes 280 characters in a tweet on
“Canada”, “University”, “Education” etc. may contain essential information related to education in
Canada and incoming.
Summer 2019 CSCI 5408 Assignment 2
*** Your Tasks for this Assignment ***
A. Cluster Setup:
1. Create a cloud account (if you do not have one) with any cloud service provider.
2. Initialize Apache Spark on your cloud account. Follow the tutorials provided in Labs.
3. If you do not wish to work on cloud account, you must create local standalone Hadoop
cluster to perform the operation.
4. Install MongoDB to store the data
B. (i)Twitter Data Extraction & Transformation:
5. Create a Twitter developer account (Approval might take 7 days, therefore, create account
ASAP)
6. Explore the Twitter search and streaming APIs and data format
7. Write a well-formed script/program using (Java or Python or php or Perl etc.) to extract
data from Twitter. (Do not use any online program codes or scripts. You can only use API
specification codes given by Twitter - “tweepy”)
a. The search keyword is “Canada”, “University”, “Dalhousie University”, “Halifax”,
“Canada Education”.
b. You need to extract the tweets related to the given keyword
c. Running the method/program querying search API and streaming API for 3000+
records will be enough.
Note: Working on small dataset will not use huge cloud resource or your local
cluster memory.
d. You should extract tweets, and retweets along with provided meta data, such as
location, time etc.
e. The captured data should be kept in MongoDB.
8. The data you captured from tweets using search/streaming APIs could be cleaned and
transformed before uploading to the cloud infrastructure or local cluster.
a. Remove special characters, URLs, emoticons etc. Retain “RT”
b. You can upload the JSON/XML/TXT etc. files containing the tweets to cloud/ local
cluster.
(ii) News Article Data Extraction & Transformation:
9. Visit the news API https://newsapi.org/
10. Create a developer account
11. Search the same keywords as mentioned before - “Canada”, “University”, “Dalhousie
University”, “Halifax”, “Canada Education”.
12. Clean and format the data. You need to remove special tags (if any).
13. Upload your newly created files on the cloud server or local cluster in MongoDB.
C. Data Processing (Spark):
14. Using Spark framework perform a frequency count of the following substrings or words.
You need to consider the stored tweets and the stored news articles for frequency count.
a. “education”
b. “Canada”
c. “university”
d. “dalhousie”
e. “expensive”
Summer 2019 CSCI 5408 Assignment 2
f. “good school” or “good schools”
g. “bad school” or “bad schools” or “poor school” or “poor schools”
h. “faculty”
i. “computer science”
j. “graduate”
Submission Instruction:
• Create a Folder with your name and B00 number, and store all your files –
o PDF file with at most 2-page report that includes the following
§ Your cloud setup steps,
§ Data extraction process,
§ Cleaning process,
§ Sample JSON/XML/or any other formats of data file
o Screenshots of your cloud dashboard or local cluster, and processing of data
o Program or script file (Source Code)
o Any dictionary or supporting file(s) required for the program to run
o An output file (.txt format). You may also include output file as part of the PDF file
• Compress the folder and create a .ZIP file (do not use other compression formats)
• Upload the .ZIP file on Brightspace.
• Submission Due: Nov 6, 2019 at 11:59 pm (midnight)