$29.99
COSC 426 / 526 - Assignment 02
Discussed: February 3, 2023
In this notebook, we provide you with basic functions for completing the assignment. You will need to modify existing code and write new code to find a solution. Upload your solution to the GitHub repository we assigned to you.
From Lecture 01 ...
This set of practical tasks is to be completed during the first lecture (Lecture 01).
Definitions:
GitHub: web-based hosting service for version control used to distribute and collect assignments as well as other class materials (e.g., slides, code, and datasets)
Git: software used by GitHub
Practical Tasks:
Create your own GitHub account
Submit your GitHub username to the Google form: https://forms.gle/c2w4vUeY4nC4uRvNA
Install Git on your laptop
This Assignment is due (pushed to your personal class GitHub repository) at the start of the second class.
Problem 1
In this problem we explore reading in and parsing delimiter-separated values stored in files. We start with comma-separated values and then move on to tab-separated values.
Problem 1a: Comma-Separated Values (CSV)
From Wikipedia: In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
If you were to consider the CSV file as a matrix, each line would represent a row and each comma would represent a column. In the provided CSV file, the first row consists of a header that "names" each column. In this problem, ...
Count (and print) the number of rows of data (header is excluded) in the csv file
Count (and print) the number of columns of data in the csv file
Calculate (and print) the average of the values that are in the "age" column
You can assume each age in the file is an integer, but the average should be calculated as a float
def parse_delimited_file(filename, delimiter=","):
# Open and read in all lines of the file
# (I do not recommend readlines for LARGE files)
# `open`: ref [1]
# `readlines`: ref [2]
with open(filename, 'r', encoding='utf8') as dsvfile:
lines = dsvfile.readlines()
# Strip off the newline from the end of each line
# Using list comprehension is the recommended pythonic way to iterate through lists
# HINT: refs [3,4]
# Split each line based on the delimiter (which, in this case, is the comma)
# HINT: ref [5]
# Separate the header from the data
# HINT: ref [6]
# Find "age" within the header
# (i.e., calculating the column index for "age")
# HINT: ref [7]
# Calculate the number of data rows and columns
# HINT: [8]
num_data_rows = 0
num_data_cols = 0
# Sum the "age" values
# HINT: ref [9]
# Calculate the average age
ave_age = 0
# Print the results
# `format`: ref [10]
print("Number of rows of data: {}".format(num_data_rows))
print("Number of cols: {}".format(num_data_cols))
print("Average Age: {}".format(ave_age))
# Parse the provided csv file
parse_delimited_file('data.csv')
Number of rows of data: 0
Number of cols: 0
Average Age: 0
Expected Ouput:
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
References:
1: open
2: readlines
3: list comprehension
4: rstrip
5: split
6: splice
7: "more on lists"
8: len
9: int
10: format
Problem 1b: Tab-Separated Values (TSV)
From Wikipedia: A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.
In this problem, repeat the analyses performed in the prevous problem, but for the provided tab-delimited file.
NOTE: the order of the columns has changed in this file. If you hardcoded the position of the "age" column, think about how you can generalize the parse_delimited_file function to work for any delimited file with an "age" column.
# Further reading on optional arguments, like "delimiter": http://www.diveintopython.net/power_of_introspection/optional_arguments.html
parse_delimited_file('data.tsv', delimiter="\t")
Expected Ouput:
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
Problem 2
If you opened the data.csv file, you may have noticed some non-english letters in the names column. These characters are represented using Unicode, a standard for representing many different types and forms of text. Python 3 natively supports Unicode, but many tools do not. Some tools require text to be formatted with ASCII.
Convert the unicode-formatted names into ascii-formated names, and save the names out to a file named data-ascii.txt (one name per line). We have provided you with a tranliteration dictionary that maps several common unicode characters to their ascii transliteration. Use this dictionary to convert the unicode strings to ascii.
translit_dict = {
"ä" : "ae",
"ö" : "oe",
"ü" : "ue",
"Ä" : "Ae",
"Ö" : "Oe",
"Ü" : "Ue",
"ł" : "l",
"ō" : "o",
}
with open("data.csv", 'r', encoding='utf8') as csvfile:
lines = csvfile.readlines()
# Strip off the newline from the end of each line
# Split each line based on the delimiter (which, in this case, is the comma)
# Separate the header from the data
# Find "name" within the header
# Extract the names from the rows
unicode_names = []
# Iterate over the names
translit_names = []
for unicode_name in unicode_names:
# Perform the replacements in the translit_dict
# HINT: ref [1]
False
# Write out the names to a file named "data-ascii.txt"
# HINT: ref [2]
# Verify that the names were converted and written out correctly
with open("data-ascii.txt", 'r') as infile:
for line in infile:
print(line.rstrip())
Expected Output:
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
References:
1: replace
2: file object methods
Free-Form Questions:
Q1. Your solutions for Problems 1 & 2 probably share a lot of code in common. You might even have copied-and-pasted from Problem 1 into Problem 2. Refactor parse_delimited_file to be useful in both problems
# Add here your code
Q2. Are there any pre-built Python packages that could help you solve these problems? If yes, refactor your solutions to use those packages.
# Add here your code
Q3. Tell us about your experience (for each point below provide a couple of sentences).
Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
Did you work with other students on this assignment? If yes, how did you help them? How did they help you?
Write here your answers
Live Chat: The History of Big Data
Intel's Genevieve Bell shows that we have been dealing with big data for millennia, and that approaching big data problems with the right frame of reference is the key addressing many of the problems we face today from the keynote of Supercomputing 2013: https://youtu.be/CNoi-XqwJnA
List three key concepts you learned by watching the video.
Write here your answers
Live Chat: What we learned from 5 million books
After watching the talk of Jean-Baptiste Michel and Erez Lieberman Aiden who told us about “What we learned from 5 million books” https://www.ted.com/talks/jean_baptiste_michel_erez_lieberman_aiden_what_we_learned_from_5_million_books
Answer these questions related to the talk:
What is the take-away of this talk? Summarize it in up to 3 sentences.
What are metadata?
What is a n-gram?
What is the suppression index?
What is culturomics?
Write here your answers