Starting from:

$30

Project 4 Concordance

Project 4
CSC 202

Concordance (an application of hash tables)
This assignment has several parts: implementing different versions of the hash table from the text (with some
additional functionality) and writing an application that builds a concordance. A Webster’s dictionary
definition of concordance is: “an alphabetical list of the main words in a work.” In addition to the main
words, your program will keep track of all the line numbers where these main words occur.
Word and Line Concordance Application
The goal of this assignment is to process a textual data file to generate a word concordance with line numbers
for each main word. A dictionary ADT is perfect to store the word concordance with the word being the
dictionary key and a list of its line numbers being the associated value for the key. Since the concordance
should only keep track of the “main” words, there will be a second file containing words to ignore, namely a
stop-words file (stop_words.txt). The stop-words file will contain a list of stop words (e.g., “a”, “the”, etc.) --
these words will not be included in the concordance even if they do appear in the data file. You should not
include strings that represent numbers. E.g. “24” or “2.4” should not appear. There is no distinction between
upper and lower case letters. (For simplicity convert all characters to lowercase letters.)
The following is an example and the output file.
The general algorithm for the word-concordance program is:
1) Read the stop_words.txt file into your implementation of a hashtable containing only stop words.
Make sure to check the number of stop words so that you can size the dictionary appropriately! (WARNING:
Make sure you do not include the newline character (‘\n’) when adding the stop words.)
2) Process the input file one line at a time to build the word-concordance dictionary. This dictionary should
contain the non-stop words as the keys. Associated with each key is its value where the value consists of a list
containing the line numbers where the key appears in the file. DO NOT INCLUDE DUPLICATE LINE
NUMBERS. (Note: If a word appears more than once on a line only print out the line number once
for that word.)
3) Generate a text file containing the concordance words printed out in alphabetical order along with their
line numbers. One word per line. See the sample output files.
(Note: It is strongly suggested that the logic for reading words and assigning line numbers to them be
developed and tested separately from other aspects of the program. This could be accomplished by reading a
sample file and printing out the words recognized with their corresponding line numbers without any other word
processing.)
DICTIONARY ADT COMPARISON
Implement 2 dictionary ADT implementations:
 Open Addressing using linear probing
 Open Addressing using quadratic probing
Your basic hash function should take a string containing one or more characters and return an integer. Use
Horner’s rule for string hashing to compute the hash efficiently. (This function explained below.)
Also your hashtable size should have the capability to grow if the input file is large. You should start with a
default hash table size of 251, then 503, then if further increases are necessary, use
new table size = 2* old table size + 1 , and use this “new table size” even if it is no longer a prime. (If load
factor .75 increase the size of hash table.)
DATA FILES –
 the stop words in the file stop_words.txt
 four sample data files that can be used for preliminary testing of your programs:
o input1.txt, concord1.txt - contains no punctuation to be removed
o input2.txt, concord2.txt - contains punctuation to be removed
The hash table classes you implement should each contain the following methods with exact signatures below.
These methods may be called in grading your program
─ def __init__ (self, size): creates and initializes the hash table size to size
─ def read_stop (self, filename): read words from a stop words file and insert them into hash table
─ def read_file (self, filename, stop_table): read words from input file and insert them into hash table,
after processing for punctuation, numbers and filtering out stop words in the stop_table
─ def get_tablesize(self): returns the size of the hash table
─ def save_concordance(self, output_filename): see sample output file for format
─ def get_load_fact(self): returns the load factor of the table
─ def myhash(self, key, table_size) and return an integer from 0 to the (size of the hash table)  1.
Compute the hash value by h_value(str) = ∑ 𝑜𝑟𝑑(𝑠𝑡𝑟[𝑖]) ∗ 31 𝑛−1 𝑛−1−𝑖
0

where n = the minimum of len(key) and 8 (e.g., if len (key) 8 assume n=8) , i = the index of each
character of the key.
Example:
key = ‘Cat’, which is same as ‘cat’
n= len(key) = 3 and i = 0 for ‘c’, i = 1 for ‘a’, and i = 2 for ‘t’,
h_vslue = (ord(‘c’) * 313-1-0 + ord(‘a’) * 313-1--1 + ord(‘t’) * 313-1-2
) % table_size
SUBMISSION
Two files:
─ hash_lin_table.py containing -- class HashTableLinPr for Linear Probing
─ hash_quad_table .py containing -- class HashTableQuadPr for Quadratic Probing
Helpful resources:
 A “How to on sorting in Python”: https://docs.python.org/3/howto/sorting.html . You may use the builtin sorting routines in Python so you may find this reference helpful.
 Python has some build in capability to eliminate punctuation that you may find helpful. See string.
constants https://docs.python.org/3.1/library/string.html
 The following is a simple approach to test if a string contains a number:
def is_number(s):
 try:
 float(s)
 return True
 except ValueError:
 return False

More products