Starting from:

$30

FIT1043 Assignment 3

FIT1043 Assignment 3

Hand in Requirements:
1) Please hand in a PDF file containing your answers to all the questions and,
numbered correspondingly.
2) Your report should include the following cases:
● The screenshots/images of the outputs/graphs you generate in order to
justify your answers to all the questions.
● Copies of all the bash command lines and R scripts you use. If your
answer is wrong, you may still get half marks if your command line or
script is close to correct.
3) Please be informed that you need to explain what each part of command does
for all your answers. For instance, if the code you use is ‘unzip
tutorial_data.zip‘, you need to explain that the code is used to uncompress the
zip file.
4) Please don’t include the questions into the assignment (It has 5% penalty).
NOTE: Two data sets for this assignment are in the Google shared drive:
https://drive.google.com/drive/folders/1Ala0KnoHlgeXFpxVwr7OJaOzSR7xxHSN?us
p=sharing
Both are large, so your best bet is to download them while in the lab/studio and do
the assignment there. You will need to use either a Linux machine for this or a Mac
terminal or Cygwin on a Windows machine.
Assignment Tasks:
This assignment is worth 50 marks, which makes up for 15% of this Unit’s
assessment. There are two tasks that you need to complete for this assignment.
Students that complete only Tasks A1-A7, B1-B4 can only get a maximum of
Distinction. Students that attempt tasks A8 and B5 will be showing critical analysis
skills and a deeper understanding of the task at hand and can achieve the highest
grade. You need to use the Unix shell and R to complete the tasks.
[30 marks] Task A: Investigating User global-scale check-in data collected from
Foursquare Data in the Shell
Download the file dataset_TIST2015.tar, which contains user check-in data from
Foursquare (https://foursquare.com/).
1) [4 marks] Decompress the tar file and have a look at the files it contains. How
many files are there in the tar file? How big is each file?
2) [2 marks] What delimiter is used to separate the columns in the file
(dataset_TIST2015_Checkins_v2.txt) and how many columns are there?
3) [4 marks] The first column in dataset_TIST2015_Checkins_v2.txt file is
user_id. What are the other columns? Print out the names of the columns.
4) [4 marks] How many Checkins are there in the file? and how many users are
there in the file?
5) [2 marks] What is the first and last dates in
dataset_TIST2015_Checkins_v2.txt file (Assume that the data is ordered by
date in chronological order)
6) [4 marks] How many unique venue IDs are there in the
dataset_TIST2015_POIs.txt file?
7) [4 marks] How many unique Venue categories are included in
dataset_TIST2015_POIs.txt file for France (Hint: FR is country code for
France)
8) Background: How would you select venues from Europe? Consider the
structure of the data presented in the readme file. Check-ins are indexed by a
Venue ID, and these are described separately in a separate file, the POI file.
You can select European venues from the POI file in (at least) two ways:
select items in a latitude longitude bounding box, or select items by country
code. Don’t be too fussed by the exact locations (include or exclude Turkey,
Ukraine, etc., that is OK either way).
[6 marks] Create an awk script to create a European subset of the POI file,
and name the subset file “POIeu.txt”. Investigate your European subset.
A. Submit the created POIeu.txt along with your PDF file.
B. What country has the most venues and what the least, with how many?
C. Which country has the most Seafood restaurants?
D. What is the most common (as in, how many venues) class of
restaurant in Europe?
[20 marks] Task B: Investigating the Twitter Data in the Shell and Graphing in R
In this task you are working with Twitter_Data_1.gz data file. Please decompress the
file and answer the following questions.
1) [2 marks] How many times does the term ‘Donald Trump’ appear in tweets?
(Note: If the term appears two times in a tweet, we count as two)
2) [5 marks] Background: We want to consider how the amount of discussion
regarding Donald Trump varies over the time period covered by the data file.
To answer this question, you will need to extract the timestamps for all tweets
referring to Donald Trump. You will then need to read them into R and
generate a histogram. [Hint: To read the data into R, first generate a file
containing only the timestamp column as text. Then read the file into R as a
CSV.] R will not recognise the strings as timestamps automatically, so you’ll
need to convert them from text values using the strptime() function.
Instructions on how to use the function are available here:
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/strptime.html).
[Note: the histogram should be plotted in the next question (Q3).]
Question: You will need to write a format string, starting with “%a %b” to tell
the function how to parse the particular date/time format in your file. What
format string do you need to use?
3) [6 marks] (R code) Once you’ve converted the timestamps, use the hist()
function to plot the data. [Hint: you may need to set the number of bins
sufficiently high to see the variation over time well.]
4) [3 marks] [R code] The plot has a bit of an unusual shape. Can you see a
pattern before Feb 15 and what happens after that?
5) [4 marks] Plot a second histogram, but this time showing the distribution over
number of tweets per author in the file.
[Hint: You’ll need to count up the number of Tweets by each unique author in
the Twitter file giving a file with two columns “user” and “twitter count” in the
bash Shell . Then load them into R. This is a large file so you can also just
isolate the counts, sort and count them to get a summary statistics file with
columns “twitter count” and “number of users”.]
Good Luck!

More products