$30
Assignment 3a: Huffman Encoding
Description
For this project, you will implement a program to encode text using Huffman coding.
Huffman coding is a method of lossless (the original can be reconstructed perfectly)
data compression. Each character is assigned a variable length code consisting of ’0’s
and ’1’s. The length of the code is based on how frequently the character occurs; more
frequent characters are assigned shorter codes.
The basic idea is to use a binary tree, where each leaf node represents a character and
frequency count. The Huffman code of a character is then determined by the unique
path from the root of the binary tree to that leaf node, where each ’left’ accounts for a ’0’
and a ’right’ accounts for a ’1’ as a bit. Since the number of bits needed to encode a
character is the path length from the root to the character, a character occurring
frequently should have a shorter path from the root to their node than an “infrequent”
character, i.e. nodes representing frequent characters are positioned less deep in the
tree than nodes for infrequent characters.
The key problem is therefore to construct such a binary tree based on the frequency of
occurrence of characters. A detailed example of how to construct such a Huffman tree
is provided in the lecture on Huffman Coding.
Note:
● You must provide test cases for all functions (Only test the provided functions,
with the exact name provided)
● Use descriptive names for data structures and helper functions.
Prerequisites
This assignment depends on Min Priority Queue in Lab 7. Complete Lab 7 before starting this
assignment.
1
CPE202 Spring 2020
Data Definitions
Let’s write a recursive data definition for a Huffman tree, which is a possibly empty
binary tree of HuffmanNodes:
Huffman Tree is one of None or HuffmanNode
The definition is very similar to the definitions of binary trees we have seen before
except that the class for each node is called HuffmanNode. This means that each
HuffmanNode is a Huffman Tree (either the root node of the entire tree or a subtree).
In huffman.py, write a definition for HuffmanNode class. The class is required to have
four fields: char for storing a character, freq for the character’s frequency, left for
storing a pointer for the left subtree, and right for the right subtree. Do not change the
names of the fields specified. Make sure you implement, __eq__ and __repr__
methods as well.
As usual, follow the six steps of the design recipe.
A HuffmanNode class represents either a leaf or an internal node (including the root
node) of a Huffman tree. A HuffmanNode contains a character value (either as the
character itself or the ord() of the character, your choice) and an occurrence count, as
well as references to left and right Huffman subtrees, each of which is a binary tree of
HuffmanNodes. The character value and occurrence count of an internal node are
assigned as described below. You may want additional fields in your HuffmanNodes if
you feel it is necessary for your implementation.
● define the __lt__ (self, other) method for HuffmanNode objects that returns true if
self should come before other when added to a Min Priority Queue. A
HuffmanNode ‘a’ should come before HuffmanNode ‘b’ if the occurrence count of
‘a’ is smaller than that of ‘b’. In case of equal occurrence counts, break the tie by
using the ASCII value of the character to determine the order. If, for example, the
characters ’d’ and ’k’ appear exactly the same number of times in your file, then
’d’ comes before ’k’, since the ASCII value of character ’d’ is less than that of ’k’.
● __lt__ method overrides ‘<’ operator in Python. This means that when two
objects of HuffmanNode are compared by the ‘<’ operator in your MinPQ object,
2
CPE202 Spring 2020
the __lt__ method in the object of HuffmanNode is used. This is why it is
important that you only use the ‘<’ operator to compare two items in your MinPQ,
particularly shif_up(), shift_down(), and other helper functions.
Functions
The following bullet points provide a guide to implement some of the data structures and
individual functions of your program. Start by creating a file huffman_coding.py and
add functions to this file, if not otherwise stated. You should develop incrementally,
building test cases for each function as it is implemented.
Count Occurrences: cnt_freq(filename)
● Implement a function called cnt_freq(filename) that opens a text file with a given
file name (passed as a string) and counts the frequency of occurrences of all the
characters within that file. Use the built-in Python List data structure of size 256
for counting the occurrences of characters. This will provide efficient access to a
given position in the list. (In non-Python terminology you want an array.) You can
assume that in the input text file there are only 8-bit characters resulting in a total
of 256 possible character values including new line character and space. This
function should return the 256 item list with the counts of occurrences. There can
be issues with extra characters in some systems, especially new line character
vs carriage return on Mac vs Windows.
○ Suppose the file to be encoded contained:
ddddddddddddddddccccccccbbbbaaff
○ Numbers in positions of freq counts [96:104] = [0, 2, 4, 8, 16, 0, 2, 0]
○ For an empty file, this function should return a list of size 256 filled with all
0s.
○ For a null character, whose ascii code is 0, put the count as 1: counts[0] =
1. We are going to use the null character to mark the end of the encoding
when we write an encoded string as binary bits. This means that we stop
decoding when we encounter the encoding for a null character.
Build a Huffman Tree
3
CPE202 Spring 2020
Since the code depends on the order of the left and right branches taken in the path
from the root to the leaf (character node), it is crucial to follow a specific convention
about how the tree is constructed. To do this we need an ordering on the Huffman
nodes.
● Write a function that builds a Huffman tree from a given list of the number of
occurrences of characters returned by cnt_freq() and returns the root node of
that tree. Call this function create_huff_tree(list_of_freqs).
○ Start by creating a Min Priority Queue (your implementation from Lab 7) of
individual Huffman trees each consisting of a single HuffmanNode
containing the character and its occurrence counts. You can either create
an array of Huffman Trees (HuffmanNode) and convert the array into a
min heap using the heapify operation (O(n)), or repeat the insert operation
for n characters in a given file (O(n log n)). Make sure to skip characters
whose frequencies are 0, and to include every character whose counts are
greater than 0 including a new line character and a space.
○ Building the actual tree involves removing the two nodes with the lowest
frequency count from the Min Priority Queue and connecting them to the
left and right field of a new created Huffman Node as in the example
provided: Do del_min() twice, create a new tree (HuffmanNode) by
merging the two removed trees (HuffmanNode) by putting the tree
(HuffmanNode) removed first on the left and the tree (HuffmanNode)
removed second on the right.
○ Note that when connecting two HuffmanNodes to the left and right field of
a new parent node, that this new node is also a HuffmanNode, but does
not contain an actual character to encode. Instead this new parent node
should contain an occurrence count that is the sum of the left and right
child occurrence counts as well as the minimum of the left and right
character representation in order to resolve ties in the __lt__ method.
○ Once a new parent node has been created from the two nodes with the
lowest occurrence count as described above, that parent node is inserted
into the Min Priority Queue.
○ This process of connecting nodes from the front of the sorted list is
continued until there is a single Tree (HuffmanNode) left in the list, which
is the root node of the Huffman tree. create_huff_tree(list_of_freqs) then
returns this node.
○ If there are no entries in the list of occurrences passed to this function, it
should return None.
4
CPE202 Spring 2020
○ If there is only one entry in the list of occurrences passed to this function,
the tree returned by this function will consist of a single node.
○ Consider the benefit of using Min Priority Queue as our data structure for
building a Huffman Tree, instead of using a sorted list.
Build an Array for the Character Codes
● We have completed our Huffman tree, but we are still lacking a way to get our
Huffman codes. Implement a function named create_code(root_node) that
traverses the Huffman tree that was passed as an argument and returns an array
(using a Python list) of 256 strings. Use the character’s respective integer ASCII
representation as the index into the array, with the resulting Huffman code for
that character stored at that location. Traverse the tree from the root to each leaf
node and adding a ’0’ when we go ’left’ and a ’1’ when we go ’right’ constructing
a string of 0’s and 1’s. You may want to:
○ use the built-in ’+’ operator to concatenate strings of ’0’s and ’1’s here.
○ You may want to initialize a Python list of strings that is initially consists of
256 empty strings in create_code. When create_code completes, this list
will store for each character (using the character’s respective integer
ASCII representation as the index into the list) the resulting Huffman code
for the character. The code will be represented by a sequence of ’0’s and
’1’s in a string. Note that many entries in this list may still be the empty
string. Return this list.
2.5 Huffman Encoding
● Write a function called huffman_encode(in_file, out_file) (use that exact name)
that reads an input text file and writes to an output file the following:
○ A header (see below for format) on the first line in the file (should end with
a newline)
○ Using the Huffman code, the encoded text into an output file.
● Write a function called create_header(list_of_freqs) that takes as parameter the
list of freqs previously determined from cnt_freq(filename). The create_header
function returns a string of the ASCII values and their associated frequencies
from the input file text, separated by one space. For example,
create_header(list_of_freqs) would return “0 1 97 3 98 4 99 2” for the text
“aaabbbbcc”. The “0 1” is for the null character.
● The huffman_encode function accepts two file names in that order: input file
name and output file name, represented as strings. If the specified output file
5
CPE202 Spring 2020
already exists, its old contents will be erased. See example files in the test cases
provided to see the format.
● Note: Writing the generated code as a string for each character into a file will
actually enlarge the file size instead compress it. The explanation is simple:
although we encoded our input text characters in sequences of ’0’s and ’1’s,
representing actual single bits, we write them as individual ’0’ and ’1’ characters,
i.e. 8 bits.
● To actually obtain a compressed file you will also write the ’0’ and ’1’ characters
as individual bits. Do this by taking the out_file string and:
○ Add _compressed to the name of the file before the .txt file. (Ex: if out_file
is named “output.txt”, you will create the string “output_compressed.txt”)
○ Use the filename created above to create a HuffmanBitWriter object using
the huffman_bit_writer module.
○ Use the methods in the HuffmanBitWriter object to write the header and
individual bits of the character codes to the compressed file. Add the
code for the null character at the end in order to mark the end of the
encoding.
Tests
● Download a starter code, huffman.py, huffman_bit_writer.py, and the test input
files provided on Canvas.
● Also some tests are provided for your convenience on Canvas.
● Write sufficient tests using unittest to ensure full functionality and correctness of
your program. Start by using the supplied huffman_tests.py, and the various text
files (e.g. file1.txt and file1_soln.txt) to begin testing your programs. You may
want to initially comment out some of the tests, or ignore failures when testing
functionality that hasn’t been fully implemented.
● When testing, always consider edge conditions like the following cases:
○ If the input file consists only of some number of a single character, say
"aaaaa", it should write just that character followed by a space followed by
the number of occurrences with a newline: “0 1 97 5\n”
○ In case of an empty input text file, your program should also produce an
empty file.
○ If an input file does not exist, your program should raise a
FileNotFoundError.
6
CPE202 Spring 2020
Some Notes
When writing your own test files or using copy and paste from an editor, take into
account that most text editors will add a newline character to the end of the file. If you
end up having an additional newline character ’\n’ in your text file that wasn’t there
before, then this ’\n’ character will also be added to the Huffman tree and will therefore
appear in your generated string from the Huffman tree. In addition, an actual newline
character is treated differently, depending on your computer platform. Different
operating systems have different codes for "newline". Windows uses ’\r\n’ = 0x0d 0x0a,
while Unix and Mac use ’\n’ = 0x0a. It is always useful to use a hex editor to verify why
certain files that appear identical in a regular text editor are actually not identical.
Submission
You must submit the following files to Gradzilla:
● huffman.py, containing the data definition for HuffmanNode.
● huffman_coding.py, containing the functions specified and any helper functions
necessary
○ cnt_freq(filename): returns a Python list of 256 integers representing the
frequencies of characters in file (indexed by ASCII value of characters)
○ create_huff_tree(list_of_freqs): returns the root node of a Huffman Tree, a
Huffman node object
○ create_code(root_node): returns a Python list of 256 strings representing
the code for each character (indexed by ASCII value of the character).
Note: Use an empty string to represent the code for ASCII characters that
do not appear in the file being compressed.
○ create_header(list_of_freqs): returns a header for the output file with ascii
values and their associated counts, space separated.
○ huffman_encode(in_file, out_file): encodes in_file and writes the it to
out_file
● huffman_tests.py, containing test cases for the functions specified by the
assignment
○ This file should contain all tests for your solution, including tests for any
helper functions that you may have used. These tests will be used to verify
that you have 100% coverage of your solution. Your solution must pass
these tests.
● min_pq.py, your correct implementation of the Min Priority Queue.
7
CPE202 Spring 2020
● huffman_bit_writer.py, unmodified, as provided to you.
● other files necessary to run your program and tests.
There will be two tests available on Gradzilla: one for the part a and the other for the
part b. The grader for part a is for your convenience: the score does not count toward
your grade. But make sure that your program passes all your tests before moving on to
part b. The grader for part b tests every aspect of this project assignment including both
part a and b, and Min Priority Queue. The score given by the grader will be basically
your grade, but some points might be deducted manually for some reasons including
being late.
8