$35
CS 594 / CS 690 - Assignment 03
For this assignment, you must work in groups of one or two students. Each person is responsible to write their own code, but the group will (together) discuss their solution. In this notebook, we provide you with basic functions for completing the assignment. Complete the assignment in this notebook. You will need to modify existing code and write new code to find a solution. Each member of the group must upload their own work (i.e., a notebook file) to GitHub.
Note: Running a cell will not rerun previous cells. If you edit code in previous cells, you must rerun those cells. We recommend using Run All to avoid any errors results from not rerunning previous cells. You can find this in the menu above: Cell -> Run All
Data Pre-Processing:
Below is code to process a unicode text file into a string of only upper-case A-Z characters. We use this code to read the text file (i.e., "The Count of Monte Cristo") and prepare the text for the following three problems. The output string, which you should use for solving the assignment problems, is named text_upper.
# Import regular expressions library
import re
# Read the text file
with open('book_CountOfMonteCristo.txt', 'r') as f:
text_lines = f.readlines()
# Concatenate the list of strings into a single string
text_all = ''.join(text_lines)
# Remove all non-alphabet characters with a regular expression
text_alpha = re.sub(r'[^a-zA-Z]', ' ', text_all)
# Convert characters to upper-case
text_upper = text_alpha.upper()
# Uncomment the following line if you would like to see the first 100 characters
#print(text_upper[:100])
Problem 1:
Analyze the text for word length frequency. We might expect short words to be more common than long words. But, are words of length 2 more common than words or length 3? Are words of length 3 more common that words of length 4? Use the text you parsed in the previous cell to count the frequency of each word length in the text. Below, we provide you with the first step to solve this problem and hints for what to do next.
# Convert the string of text into a list of words and remove empty words
# HINT: ref [1]
words = [w for w in text_upper.split(' ') if w is not '']
# Uncomment the following line to see the first 100 words
#print(words[:100])
# Define dictionary to store count of word lengths
# HINT: ref [2]
word_len_dict = {}
# Convert list of words to list of word lengths
word_lens = [len(w) for w in words]
# Count the words of each length and store in the dictionary
# HINT: ref [3]
for wl in word_lens:
if wl in word_len_dict.keys():
word_len_dict[wl] += 1
else:
word_len_dict.update({wl: 1})
# Sort word length by most common into a list of (word length, count) tuples
# HINT: ref [4,5,6]
word_len_tuples = sorted(word_len_dict.items(), key=lambda x: (-x[1], x[0]))
# Print the 6 most common word lengths
# HINT: ref [7]
print("{:^11} : {:^5}".format("Word Length", "Count"))
for wl, cnt in word_len_tuples[:6]:
print("{:^11} : {:^5}".format(wl, cnt))
Word Length : Count
3 : 109798
2 : 84021
4 : 81777
5 : 49101
6 : 39015
7 : 30701
Expected Output:
Word Length : Count
3 : 109798
2 : 84021
4 : 81777
5 : 49101
6 : 39015
7 : 30701
References:
1: list comprehensions
2: dictionaries
3: for loops
4: sorted
5: dict.items
6: lambda expressions
7: format string syntax
Problem 2:
Analyze the text for letter frequency. If you have taken a crypto course and/or have seen substitution ciphers then you are probably aware that ’e’ is the most common letter used in the English language. Use the text you parsed above to count the frequency of each letter in the text. Below, we provide you with the first step to solve this problem and hints for what to do next.
# Import string library to help us make a dictionary for this solution
import string
# Concatenate the list of words into a string of characters
# HINT: ref [1]
chars = "".join(words)
# Define a dictionary for counting letter frequency
# HINT: ref [2,3]
char_count_dict = {c:0 for c in string.ascii_uppercase}
# Count the letters and store in the dictionary
for c in chars:
char_count_dict[c] += 1
# Sort letters by most common into a list of (letter, count) tuples
char_count_tuples = sorted(char_count_dict.items(), key=lambda x: (-x[1], x[0]))
# Print the 6 most common characters
# HINT: ref [4]
print("{:^9} : {:^5}".format("Character", "Count"))
for char, cnt in char_count_tuples[:6]:
print("{:^9} : {:^5}".format(char, cnt))
Character : Count
E : 258693
T : 180211
A : 165306
O : 156817
I : 142095
N : 137343
Expected Output:
Character : Count
E : 258693
T : 180211
A : 165306
O : 156817
I : 142095
N : 137343
References:
1: str.join
2: list comprehensions
3: dictionaries
4: format string syntax
Problem 3:
If we really wanted to crack a substitution cipher (or win on ”Wheel of Fortune”) then we should be aware that, although "e" is the most common letter used in English, it may not be the most common first letter in a word. Count the positional frequencies of each letter. Specifically, count the number of times each letter appears as the first letter in a word, as the last letter in a word, and as an interior letter in a word (i.e. a letter that is neither first nor last). Below, we provide you with the first step to solve this problem and hints for what to do next.
Below, we define a method which takes in a word (string) and returns a list describing the position of each letter (character). Each item in the list has the format (letter, position). The position can have the following values: 0: first letter, 1: interior letter, 2: last letter
# Define a method to return position of each character in a word
def lettersPosition(word):
if len(word) == 1:
# Base case for words of length 1
return [(word, 0)]
else:
# Get first and last letters
first, last = word[0], word[-1]
pos_list = [(first, 0), (last, 2)]
# Get interior letters
interior = word[1:-1]
for char in interior:
pos_list.append((char, 1))
return pos_list
We can call this method for each word in our list of words and sum the values to obtain letter position frequencies.
# Define a dictionary for counting letter position frequency
char_pos_dict = {c:[0,0,0] for c in string.ascii_uppercase}
# Apply our letter location method to each word and sum values in our dictionary
for w in words:
for char, pos in lettersPosition(w):
char_pos_dict[char][pos] += 1
# Print the position frequency of the first 6 letters in the alphabet
print("{:^9} : {:^8} | {:^8} | {:^8}".format("Character", "First", "Interior", "Last"))
for char in string.ascii_uppercase[:6]:
pos = char_pos_dict[char]
print("{:^9} : {:^8} | {:^8} | {:^8}".format(char, pos[0], pos[1], pos[2]))
Character : First | Interior | Last
A : 51644 | 111686 | 1976
B : 18866 | 8516 | 541
C : 19577 | 32130 | 725
D : 17289 | 18613 | 58075
E : 10178 | 153205 | 95310
F : 17724 | 10618 | 16988
Expected Output:
Character : First | Interior | Last
A : 51644 | 111686 | 1976
B : 18866 | 8516 | 541
C : 19577 | 32130 | 725
D : 17289 | 18613 | 58075
E : 10178 | 153205 | 95310
F : 17724 | 10618 | 16988
Problem 4:
For problems 1, 2, and 3 you may want to present your results in a (graphically) nice way. This could be done with a histogram. It is probably easiest to build the histogram using whatever software package you feel comfortable with, but we recommend the python module matplotlib. Make sure to give your plots meaningful labels (including axis labels and a title).
# Import matplotlib's pyplot
# HINT: ref[1]
from matplotlib import pyplot as plt
We provide the code to plot the histogram of word lengths from problem 1 below. Use this code as a template to produce histograms for problems 2 and 3. Note: this code assumes that you have a dictionary, word_len_dict, with keys containing the word length and values containing the count of words with a given length (i.e., an entry of 3:109798 in the dictionary indicates that there are 109,798 words of length 3).
# Sort word length dictionary by length of word
wl_sorted = sorted(word_len_dict.items(), key=lambda x: x[0])
# Get X and Y values
# HINT: ref[2]
X_vals, Y_vals = zip(*wl_sorted)
# Plot the histogram for problem 1'
# HINT: ref[3]
plt.bar(X_vals, Y_vals, 0.75)
plt.xlim((.125, len(X_vals)))
plt.xticks(range(1,len(X_vals)+1,2))
plt.xlabel('Word Length')
plt.ylabel('Word Length Count')
plt.title('Word Length Count in The Count of Monte Cristo')
plt.show()
References:
1: pyplot tutorial
2: zip
3: pyplot.bar
In the cell below, write the code for the visualization of results from problem 2.
# Sort character count dictionary alphabetically
cc_sorted = sorted(char_count_dict.items(), key=lambda x: x[0])
# Get X and Y values
# HINT: ref[2]
X_vals, Y_vals = zip(*cc_sorted)
# Plot the histogram for problem 2'
# HINT: ref[3]
plt.bar(range(1, len(X_vals)+1), Y_vals, 0.75)
plt.xlim((.125, len(X_vals)))
plt.xticks(range(1,len(X_vals)+1), X_vals)
plt.xlabel('Character')
plt.ylabel('Character Count')
plt.title('Character Count in The Count of Monte Cristo')
plt.show()
In the cell below, write the code for the visualization of results from problem 3.
# Sort character positional count dictionary alphabetically
cp_sorted = sorted(char_pos_dict.items(), key=lambda x: x[0])
# Get X and Y values
# HINT: ref[2]
X_vals, Y_vals = zip(*cp_sorted)
# Plot the histogram for problem 3'
# HINT: ref[3]
for pos, label in [(0, "First"), (1, "Interior"), (2, "Last")]:
plt.bar(range(1, len(X_vals)+1), [y[pos] for y in Y_vals], 0.75)
plt.xlim((.125, len(X_vals)))
plt.xticks(range(1,len(X_vals)+1), X_vals)
plt.xlabel('Character')
plt.ylabel('Character Count')
plt.title('Character Count ('+label+' Position) in The Count of Monte Cristo')
plt.show()
Things to Consider:
You are reading and analyzing the text in a single, sequential pass. Can you consider a better (faster) way to perform the same tasks?
Assignment Questions:
Answer the following questions, in a couple sentences each, in the cell provided below
List the key tasks you accomplished during this assignment?
Describe the challenges you faced in addressing these tasks and how you overcame these challenges?
Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possible.
List the key tasks you accomplished during this assignment?
I gathered word frequencies, character frequencies, and character positional frequencies from "The Count of Monte Cristo". I printed the results as formatted tables and also plotted the frequencies as histograms.
Describe the challenges you faced in addressing these tasks and how you overcame these challenges?
I had to look up how to use the "sorted" function to sort a list of tuples. Each character has three positional frequencies (first letter in word, interior to word, and last letter in word). I had to decide how best to present these frequencies as histograms. I decided to plot them as three separate histograms.
Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possibl
I did not work with anyone else on this assignment.