Starting from:

$35

COSC 526 - Assignment 01

COSC 526 - Assignment 01

 
In this notebook, we provide you with basic functions for completing the assignment. You will need to modify existing code and write new code to find a solution. Each member of the group must upload their own work to their personal GitHub repository, which we set up during the first class.
 
Practical Tasks:¶
This set of practical tasks is to be completed during the first class.

Definitions:

GitHub: web-based hosting service for version control used to distribute and collect assignments as well as other class materials (e.g., slides, code, and datasets)
Git: software used by GitHub
Practical Tasks:

Create your own GitHub account
Submit your GitHub username to the Google form: https://forms.gle/CKugke8Dzqjm9tQ89
Install Git on your laptop
 
This Assignment is due (pushed to your personal class GitHub repository) at the start of the second class.

Problem 1
In this problem we explore reading in and parsing delimiter-separated values stored in files. We start with comma-separated values and then move on to tab-separated values.

Problem 1a: Comma-Separated Values (CSV)
From Wikipedia: In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

If you were to consider the CSV file as a matrix, each line would represent a row and each comma would represent a column. In the provided CSV file, the first row consists of a header that "names" each column. In this problem, ...

Count (and print) the number of rows of data (header is excluded) in the csv file
Count (and print) the number of columns of data in the csv file
Calculate (and print) the average of the values that are in the "age" columnYou can assume each age in the file is an integer, but the average should be calculated as a float
In [1]:
def parse_delimited_file(filename, delimiter=","):
    # Open and read in all lines of the file
    # (I do not recommend readlines for LARGE files)
    # `open`: ref [1]
    # `readlines`: ref [2]
    with open(filename, 'r', encoding='utf8') as dsvfile:
        lines = dsvfile.readlines()

    # Strip off the newline from the end of each line
    # Using list comprehension is the recommended pythonic way to iterate through lists
    # HINT: refs [3,4]

    
    # Split each line based on the delimiter (which, in this case, is the comma)
    # HINT: ref [5]

    
    # Separate the header from the data
    # HINT: ref [6]

    
    # Find "age" within the header
    # (i.e., calculating the column index for "age")
    # HINT: ref [7]


    # Calculate the number of data rows and columns
    # HINT: [8]
    num_data_rows = 0
    num_data_cols = 0
    
    # Sum the "age" values
    # HINT: ref [9]
        
        
    # Calculate the average age
    ave_age = 0
    
    # Print the results
    # `format`: ref [10]
    print("Number of rows of data: {}".format(num_data_rows))
    print("Number of cols: {}".format(num_data_cols))
    print("Average Age: {}".format(ave_age))
    
# Parse the provided csv file
parse_delimited_file('data.csv')

 
Number of rows of data: 0
Number of cols: 0
Average Age: 0

 
Expected Ouput:

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
References:

1: open
2: readlines
3: list comprehension
4: rstrip
5: split
6: splice
7: "more on lists"
8: len
9: int
10: format
 
Problem 1b: Tab-Separated Values (TSV)
From Wikipedia: A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

In this problem, repeat the analyses performed in the prevous problem, but for the provided tab-delimited file.

NOTE: the order of the columns has changed in this file. If you hardcoded the position of the "age" column, think about how you can generalize the parse_delimited_file function to work for any delimited file with an "age" column.
In [ ]:
# Further reading on optional arguments, like "delimiter": http://www.diveintopython.net/power_of_introspection/optional_arguments.html
parse_delimited_file('data.tsv', delimiter="\t")

 
Expected Ouput:

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
 
Problem 2
If you opened the data.csv file, you may have noticed some non-english letters in the names column. These characters are represented using Unicode, a standard for representing many different types and forms of text. Python 3 natively supports Unicode, but many tools do not. Some tools require text to be formatted with ASCII.

Convert the unicode-formatted names into ascii-formated names, and save the names out to a file named data-ascii.txt (one name per line). We have provided you with a tranliteration dictionary that maps several common unicode characters to their ascii transliteration. Use this dictionary to convert the unicode strings to ascii.
In [ ]:
translit_dict = {
    "ä" : "ae",
    "ö" : "oe",
    "ü" : "ue",
    "Ä" : "Ae",
    "Ö" : "Oe",
    "Ü" : "Ue", 
    "ł" : "l",
    "ō" : "o",
}

with open("data.csv", 'r', encoding='utf8') as csvfile:
    lines = csvfile.readlines()

# Strip off the newline from the end of each line

    
# Split each line based on the delimiter (which, in this case, is the comma)


# Separate the header from the data

    
# Find "name" within the header


# Extract the names from the rows
unicode_names = []

# Iterate over the names
translit_names = []
for unicode_name in unicode_names:
    # Perform the replacements in the translit_dict
    # HINT: ref [1]
    False

# Write out the names to a file named "data-ascii.txt"
# HINT: ref [2]


# Verify that the names were converted and written out correctly
with open("data-ascii.txt", 'r') as infile:
    for line in infile:
        print(line.rstrip())

 
Expected Output:

Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
References:

1: replace
2: file object methods
 
Free-Form Questions:
 
Q1. Your solutions for Problems 1 & 2 probably share a lot of code in common. You might even have copied-and-pasted from Problem 1 into Problem 2. Refactor parse_delimited_file to be useful in both problems
In [5]:
# Add here your code

 
Q2. Are there any pre-built Python packages that could help you solve these problems? If yes, refactor your solutions to use those packages.
In [6]:
# Add here your code

 
Q3. Tell us about your experience (for each point below provide a couple of sentences).

Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
Did you work with other students on this assignment? If yes, how did you help them? How did they help you?
 
Write here your answers
 
Live Chat: The History of Big Data
 
Intel's Genevieve Bell shows that we have been dealing with big data for millennia, and that approaching big data problems with the right frame of reference is the key addressing many of the problems we face today from the keynote of Supercomputing 2013: https://youtu.be/CNoi-XqwJnA

List three key concepts you learned by watching the video.
 
Write here your answers
 
Live Chat: What we learned from 5 million books!
 
Live Chat:
Jean-Baptiste Michel and Erez Lieberman Aiden tell us about “What we learned from 5 million books” https://www.ted.com/talks/jean_baptiste_michel_erez_lieberman_aiden_what_we_learned_from_5_million_books

Answer these questions related to the talk:

What is the take-away of this talk? Summarize it in up to 3 sentences.
What are metadata?
What is a n-gram?
What is the suppression index?
What is culturomics?
 
Write here your answers
 
Reading Assignment: MapReduce: Simplied Data Processing on Large Clusters
 
Use the three-pass approch to read the paper: Jeffrey Dean and Sanjay Ghemawat (2004) MapReduce: Simplied Data Processing on Large Clusters.

More products