$35
CS 594 / CS 690 - Assignment 01
For this assignment, you must work in groups of one or two students. Each person is responsible to write their own code, but the group will (together) discuss their solution. In this notebook, we provide you with basic functions for completing the assignment. You will need to modify existing code and write new code to find a solution. Each member of the group must upload their own work to GitHub (which we will cover in the next lecture).
Problem 1
In this problem we will explore reading in and parsing delimiter-separated values stored in files. We will start with comma-separated values and then move on to tab-separated values.
Problem 1a: Comma-Separated Values (CSV)
From Wikipedia: In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.
If you were to consider the CSV file as a matrix, each line would represent a row and each comma would represent a column. In the provided CSV file, the first row consists of a header that "names" each column. In this problem, ...
Count (and print) the number of rows of data (header is excluded) in the csv file
Count (and print) the number of columns of data in the csv file
Calculate (and print) the average of the values that are in the "age" column
You can assume each age in the file is an integer, but the average should be calculated as a float
def parse_delimited_file(filename, delimiter=","):
# Open and read in all lines of the file
# (I do not recommend readlines for LARGE files)
# `open`: ref [1]
# `readlines`: ref [2]
with open(filename, 'r', encoding='utf8') as dsvfile:
lines = dsvfile.readlines()
# Strip off the newline from the end of each line
# HINT: ref [3]
# Using list comprehension is the recommended pythonic way to iterate through lists
# HINT: ref [4]
lines = [line.strip("\n") for line in lines]
# Split each line based on the delimiter (which, in this case, is the comma)
# HINT: ref [5]
lines = [line.split(delimiter) for line in lines]
# Separate the header from the data
# HINT: ref [6]
headers = lines[0]
lines = lines[1:]
# Find "age" within the header
# (i.e., calculating the column index for "age")
# HINT: ref [7]
age_index = headers.index("age")
# Calculate the number of data rows and columns
# HINT: [8]
num_data_rows = len(lines)
num_data_cols = len(headers)
# Sum the "age" values
# HINT: ref [9]
age_sum = sum([int(line[age_index]) for line in lines ])
# Calculate the average age
ave_age = age_sum/num_data_rows
# Print the results
# `format`: ref [10]
print("Number of rows of data: {}".format(num_data_rows))
print("Number of cols: {}".format(num_data_cols))
print("Average Age: {}".format(ave_age))
# Parse the provided csv file
parse_delimited_file('data.csv')
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
Expected Ouput:
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
References:
1: open
2: readlines
3: rstrip
4: list comprehension
5: split
6: splice
7: "more on lists"
8: len
9: int
10: format
Problem 1b: Tab-Separated Values (TSV)
From Wikipedia: A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.
In this problem, repeat the analyses performed in the prevous problem, but for the provided tab-delimited file.
NOTE: the order of the columns has changed in this file. If you hardcoded the position of the "age" column, think about how you can generalize the parse_delimited_file function to work for any delimited file with an "age" column.
parse_delimited_file('data.tsv', delimiter='\t')
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
Expected Ouput:
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
Problem 2
If you opened the data.csv file, you may have noticed some non-english letters in the names column. These characters are represented using Unicode, a standard for representing many different types and forms of text. Python 3 natively supports Unicode, but many tools do not. Some tools require text to be formatted with ASCII.
Convert the unicode-formatted names into ascii-formated names, and save the names out to a file named data-ascii.txt (one name per line). We have provided you with a tranliteration dictionary that maps several common unicode characters to their ascii transliteration. Use this dictionary to convert the unicode strings to ascii.
translit_dict = {
"ä" : "ae",
"ö" : "oe",
"ü" : "ue",
"Ä" : "Ae",
"Ö" : "Oe",
"Ü" : "Ue",
"ł" : "l",
"ō" : "o",
}
def my_translit(char):
if char in translit_dict.keys():
return translit_dict[char]
return char
with open("data.csv", 'r', encoding='utf8') as csvfile:
lines = csvfile.readlines()
# Strip off the newline from the end of each line
lines = [line.strip("\n") for line in lines]
# Split each line based on the delimiter (which, in this case, is the comma)
lines = [line.split(",") for line in lines]
# Separate the header from the data
header = lines[0]
lines = lines[1:]
# Find "name" within the header
name_index = header.index("name")
# Extract the names from the rows
unicode_names = [line[name_index] for line in lines]
# Iterate over the names
translit_names = []
for unicode_name in unicode_names:
# Perform the replacements in the translit_dict
# HINT: ref [1]
translit_names.append("".join([my_translit(char) for char in unicode_name]))
# Write out the names to a file named "data-ascii.txt"
# HINT: ref [2]
with open("data-ascii.txt", "w") as file:
for name in translit_names:
file.write(name+"\n")
# Verify that the names were converted and written out correctly
with open("data-ascii.txt", 'r') as infile:
for line in infile:
print(line.rstrip())
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
Expected Output:
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
References:
1: replace
2: file object methods
Free-Form Questions:
Answer the following questions, in a couple sentences each, in the cells provided below.
Your solutions for Problems 1 & 2 probably share a lot of code in common. You might even have copied-and-pasted from Problem 1 into Problem 2. How would you refactor parse_delimited_file to be useful in both problems?
Are there any pre-built Python packages that could help you solve these problems? How could you use them?
List the key tasks you accomplished during this assignment.
Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possible.
Your solutions for Problems 1 & 2 probably share a lot of code in common. You might even have copied-and-pasted from Problem 1 into Problem 2. How would you refactor parse_delimited_file to be useful in both problems?
I would modify parse_delimited_file to accept a column name as an additional argument, and I would have the function return return the specified column as a list. In Problem 1, I could use parse_delimited_file to extract column "age", and in Problem 2, I could use the function to extract column "name".
Are there any pre-built Python packages that could help you solve these problems? How could you use them?
The pandas module is perfect for these tasks. The pandas module offers a function "read_csv" that loads a data file into a pandas Dataframe. The function "read_csv" lets us optionally specify the delimiter as well as the columns to be loaded. The Dataframe object also offers a method to compute column means.
List the key tasks you accomplished during this assignment.
a. I wrote a function that loads a delimited file, parses rows and columns, and extracts a specified column.
b. I loaded data and computed the mean of a specific column.
c. I loaded data and converted unicode strings to ascii in a specific column.
Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
I could not always remember the exact function name or syntax to perform certain operations; e.g., concatenating a list of characters into a string. I used stackoverflow to look up the appropriate function and correct syntax.
Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possible.
I did not work with anyone on this assignment.