$35
CS 594 / CS 690 - Assignment 04
For this assignment, you must work in groups of one or two students. Each person is responsible to write their own code, but the group will (together) discuss their solution. In this notebook, we provide you with basic functions for completing the assignment. Complete the assignment in this notebook. You will need to modify existing code and write new code to find a solution. Each member of the group must upload their own work (i.e., a notebook file) to GitHub.
Note: Running a cell will not rerun previous cells. If you edit code in previous cells, you must rerun those cells. If you are having trouble with undefined errors and code changes not applying, we recommend using Run All to avoid any errors results from not rerunning previous cells. You can find this in the menu above: Cell -> Run All
During lecture 3, we learned about the MapReduce programming model. In this assignment, we will use MapReduce programming model to perform the text parsing problems that we solved in assignment 3 with the power of the MapReduce programming model. Python provides a map and reduce for iterables that do not take advantage of parallel processing (i.e., they are sequential), but they work in a similar way to the parallel implementations you find in Apache Spark. We define three methods (i.e., mapSequential, reduceSequential, and reduceByKeySequential) that extend Python's map and reduce functions to act like those in Apache Spark. We will use sequential MapReduce to develop methods that can be used with the parallel MapReduce from Apache Spark.
In the following table, we have listed examples of inputs and outputs to different MapReduce methods. See if you can determine how the output was computed with the input:
Input
Function
MapReduce Call
Output
[1,2,3]
f(x)=x+1
Map
[2,3,4]
[1,2,3]
f(x,y)=x+y
Reduce
6
[('a', 1), ('b', 2), ('a', 3)]
f(x,y)=x+y
ReduceByKey
[('a', 4), ('b', 2)]
Now let's check that these functions work with out sequential implementation of map, reduce, and reduceByKey:
In [1]:
import itertools
import functools
# We wrap the original python map and reduce functions to be more powerful and resilient
def mapSequential(data, func):
return list(map(func, data))
def reduceSequential(data, func):
return functools.reduce(func, data)
def reduceByKeySequential(data, func):
reduced_data = []
for key, vals in itertools.groupby(sorted(data, key=lambda x: x[0]), key=lambda x: x[0]):
reduced_data.append((key, reduceSequential([x[1] for x in vals], func)))
return reduced_data
In [2]:
# Define our three inputs
input_1 = [1,2,3]
input_2 = [1,2,3]
input_3 = [('a', 1), ('b', 2), ('a', 3)]
# Define the two functions used
def plusOne(x):
return x + 1
def add(x, y):
return x + y
# Apply our functions to our inputs
output_1 = mapSequential(input_1, plusOne)
output_2 = reduceSequential(input_2, add)
output_3 = reduceByKeySequential(input_3, add)
# Print out outputs
print('Output 1:', output_1)
print('Output 2:', output_2)
print('Output 3:', output_3)
Output 1: [2, 3, 4]
Output 2: 6
Output 3: [('a', 4), ('b', 2)]
Data Pre-Processing:
Below is code to open a text file and return a list of words containing only upper-case unicode characters. We use this to read the text file (i.e., "The Count of Monte Cristo") and prepare the text for the following three problems. The output, which you should use for solving the assignment problems, is named words.
In [3]:
# Import regular expressions library
import re
# Define a method for reading and processing text files
def loadText(f_name):
# Read the text file
with open(f_name, 'r') as f:
text_lines = f.readlines()
# Concatenate the list of strings into a single string
text_all = ''.join(text_lines)
# Remove all non-alphabet characters with a regular expression
text_alpha = re.sub(r'[^a-zA-Z]', ' ', text_all)
# Convert characters to upper-case
text_upper = text_alpha.upper()
# Convert the string of text into a list of words and remove empty words
words = [w for w in text_upper.split(' ') if w is not '']
return words
# Load list of words
words = loadText('book_CountOfMonteCristo.txt')
Problem 1:
Analyze the text for word length frequency. We might expect short words to be more common than long words. But, are words of length 2 more common than words or length 3? Are words of length 3 more common that words of length 4? Use the pre-processed text, words, from the previous cell to count the frequency of each word length in the text using the sequential MapReduce methods we defined above.
Complete the definition of functions in the following cell. These functions are used in the next cell with the map and reduce calls that we have defined for you above.
In [4]:
# Return the length of a given word
def wordLength(word):
return len(word)
# Given a key and value, return a (key, value) pair
def makeKeyValue(key, value=1):
return (key, value)
# Count (reduce) the values for a given key (word length)
def addValues(val1, val2):
return val1+val2
In [5]:
# Map the length of each word
word_lengths = mapSequential(words, wordLength)
# Map keyvalue pairs to help count each word length
word_keyvalues = mapSequential(word_lengths, makeKeyValue)
# ReduceByKey to count number of words with each length
word_length_counts = reduceByKeySequential(word_keyvalues, addValues)
# Sort word length by most common
wl_counts_sorted = sorted(word_length_counts, key=lambda x: x[1], reverse=True)
# Print the 6 most common word lengths
print('Word Length : Count')
for word_len, count in wl_counts_sorted[:6]:
print('{:<11d} : {:>6d}'.format(word_len, count))
Word Length : Count
3 : 109798
2 : 84021
4 : 81777
5 : 49101
6 : 39015
7 : 30701
Expected Output:
Word Length : Count
3 : 109798
2 : 84021
4 : 81777
5 : 49101
6 : 39015
7 : 30701
Problem 2:
For this problem, it may be beneficial to use another MapReduce function from Apache Spark: flatMap. We define flatMapSequential below. flatMap has the ability to expand the number of elements in a mapped iterable by flattening a list of lists into a single list.
In [6]:
def flatMapSequential(data, func):
mapped_data = mapSequential(data, func)
flattened_data = [item for lst in mapped_data for item in lst]
return flattened_data
To help you become familiar with flatMap, we have an example below which should make the difference between map and flatMap apparent.
In [7]:
# Define a list of lists of integers for testing
test = [[1,2,3], [4,5,6], [7,8,9]]
# Define a function that returns the input
def dummyFunc(x):
return x
# Let's apply a map with our dummy function
test_map = mapSequential(test, dummyFunc)
print('map:', test_map)
# Let's apply a flatMap with our dummy function
test_flatmap= flatMapSequential(test, dummyFunc)
print('flatmap:', test_flatmap)
map: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
flatmap: [1, 2, 3, 4, 5, 6, 7, 8, 9]
Do you see the different between map and flatMap? If not, modify the code and try it with a different input or different function. In general, the function with which you call flatMap should return an iterable (e.g., list or tuple).
Getting back to the problem... Analyze the text for letter frequency. If you’ve taken a crypto course and/or have seen substitution ciphers then you are probably aware that ’e’ is the most common letter used in the English language. Use the pre-processed text words to count the frequency of each letter in the text using the sequential MapReduce methods we defined above.
Complete the splitWord function in the following cell, then fill in the code in the cell after. Much of this code will be similar to the final cell of Problem 1, including map and reduce calls using functions defined in Problem 1. Did you write those functions general enough to be reused?
In [8]:
# Given a word, return an iterable of characters
def splitWord(word):
return list(word)
In [9]:
# The next two calls require you to use a map function.
# Think about which map (i.e., flatMap or Map) is most suitable.
# Map list of words to list characters.
chars = flatMapSequential(words, splitWord)
# Map list of characters to list of key-value pairs.
char_keyvalues = mapSequential(chars, makeKeyValue)
# ReduceByKey to count number of occurrences of each letter.
char_counts = reduceByKeySequential(char_keyvalues, addValues)
# Sort letters by most common.
char_counts_sorted = sorted(char_counts, key=lambda x: x[1], reverse=True)
# Print the 6 most common characters.
print("{:^9} : {:^5}".format("Character", "Count"))
for char, cnt in char_counts_sorted[:6]:
print("{:^9} : {:^5}".format(char, cnt))
Character : Count
E : 258693
T : 180211
A : 165306
O : 156817
I : 142095
N : 137343
Expected Output:
Character : Count
E : 258693
T : 180211
A : 165306
O : 156817
I : 142095
N : 137343
Problem 3:
For this problem, it may be beneficial to use the numpy package. Specifically, the element-wise addition of numpy array may come in handy. Below is an example of what happens when you add two numpy arrays.
In [10]:
# Let's see a useful property of numpy arrays
# HINT: ref [1]
import numpy as np
# Define numpy arrays
a = np.array([1,1,0])
b = np.array([0,1,0])
# Print each array and the sum of each array
print(' a:', a)
print(' b:', b)
print('a+b:', a+b)
a: [1 1 0]
b: [0 1 0]
a+b: [1 2 0]
References:
1: numpy quickstart
If we really wanted to crack a substitution cipher (or win on ”Wheel of Fortune”) then we should be aware that, although ’e’ is the most common letter used in English, it may not be the most common first letter in a word. Count the positional frequencies of each letter using the sequential MapReduce methods we defined above. Specifically, count the number of times each letter appears as the first letter in a word, as the last letter in a word, and as an interior letter in a word (i.e. a letter that is neither first nor last).
Complete the lettersPosition function below, then fill in the code in the cell after. Your functions are used with map and reduce calls that we have defined. Note that we use a function that has also been used in Problems 1 and 2. Using numpy arrays in the following function definition could make this assignment easier. However, you are not required to use numpy. Feel free to change code by adding more maps or reduces in order to get a correct answer.
In [11]:
# Define a method to return position of each character
# You may want to reference your solution from assignment 3
# Remember how the reduceByKey will use the returned values when writing a solution
def lettersPosition(word):
return [(word[0], np.array([1,0,0]))] \
+ [(w, np.array([0,1,0])) for w in word[1:-1]] \
+ [(word[-1], np.array([0,0,1]))]
In [12]:
# Map the location of each character
char_positions = flatMapSequential(words, lettersPosition)
# ReduceByKey the letter positions for each character
char_position_counts = reduceByKeySequential(char_positions, addValues)
# Sort character position data alphabetically
cp_sorted = sorted(char_position_counts, key=lambda x: x[0])
# Print the position frequency of the first letters in the alphabet
print('Character : First | Interior | Last')
for char, char_position in cp_sorted[:6]:
first, interior, last = char_position
print('{:<9} : {:5d} | {:>8d} | {:>5d}'.format(char, first, interior, last))
Character : First | Interior | Last
A : 51644 | 111686 | 11437
B : 18866 | 8516 | 545
C : 19577 | 32130 | 764
D : 17289 | 18613 | 58662
E : 10178 | 153205 | 95521
F : 17724 | 10618 | 17010
Expected Output:
Character : First | Interior | Last
A : 51644 | 111686 | 1976
B : 18866 | 8516 | 541
C : 19577 | 32130 | 725
D : 17289 | 18613 | 58075
E : 10178 | 153205 | 95310
F : 17724 | 10618 | 16988
Problem 4:
Repeat Problem 2 with a new text, one you expect to have different letter distribution than The Count of Monty Cristo. You could use something written centuries earlier (e.g., Shakespeare or an early English translation of the Bible), in a distinctive style or genre (e.g., poetry or a contract), or even in a different language.
In [14]:
# Place the .txt file in the folder with this Jupyter Notebook, then load it to a list of words.
# If your text is in a language with different characters than English,
# you will have to create a modified version of the function loadText().
# Compute char_counts_sorted as in Problem 2.
# Print the most frequent characters to compare with the results from Problem 2.
words = loadText('bohr_quantum_theory.tex')
# The next two calls require you to use a map function.
# Think about which map (i.e., flatMap or Map) is most suitable.
# Map list of words to list characters.
chars = flatMapSequential(words, splitWord)
# Map list of characters to list of key-value pairs.
char_keyvalues = mapSequential(chars, makeKeyValue)
# ReduceByKey to count number of occurrences of each letter.
char_counts = reduceByKeySequential(char_keyvalues, addValues)
# Sort letters by most common.
char_counts_sorted = sorted(char_counts, key=lambda x: x[1], reverse=True)
# Print the 6 most common characters.
print("{:^9} : {:^5}".format("Character", "Count"))
for char, cnt in char_counts_sorted[:6]:
print("{:^9} : {:^5}".format(char, cnt))
Character : Count
E : 39497
T : 30992
O : 23630
I : 23067
A : 22736
N : 22041
My reflection: The file bohr_quantum_theory.tex is the LaTeX source for Neils Bohr's seminal paper "The Quantum Theory of Line-Spectra". The LaTeX source contains many non-alphanumeric characters such as "\", "%"; indeed, viewing the file, it appears that there are almost more non-alphanumeric characters than alphanumeric ones. Nevertheless, the top 6 most common characters are the same (although slightly out of order) than the ones in "The Count of Monty Christo".
Things to Consider:
In this assignment you wrote functions that can be used with the MapReduce programming model. The map and reduce functions were sequential, but they work in the same way as the parallel versions. This means that the functions you wrote in this assignment can be used in the next assignment where we use MapReduce in parallel with Apache Spark!
Assignment Questions:
Answer the following questions, in a couple sentences each, in the cells provided below
List the key tasks you accomplished during this assignment?
Describe the challenges you faced in addressing these tasks and how you overcame these challenges?
Did you work with other students on this assignment? If yes, how did you help them? How did they help you? Be as specific as possible.
In this assignment, we repeated the problems from Assignment 3 but used the map-reduce programming model this time (well . . . sort of). But we used the map and reduce functions provided by python and did not actually make use of parallelization. Nevertheless, we gained an understanding of what map-reduce implementations look like at the surface level.
I did not face any serious challenges during this assignment. I did look up some functions such as map and reduce to see how they worked. That is about it.
I did not work with anyone on this assignment as I was not in class this week.
In [ ]: