Assignment 2: Dictionary Stat Generator

Your shopping cart is empty.

Programming Assignment 2: Dictionary Stat Generator

This assignment is designed to provide you some experience with C programming. Your task is to write a program that reads in a collection of dictionary and data ﬁle pairs and generates statistics about each pair. There are two parts to this assignment. The ﬁrst part of the assignment is for regular credit and the second part of the assignment is extra credit.
First Part (100 points)
Your task in the ﬁrst part is to write a C program called first that reads an input ﬁle, which contains a list of dictionary and data ﬁle pairs, and generates statistics for each pair. Each line in your input ﬁle contains the names of the dictionary and data ﬁles. You have to read the ﬁles and generate the following statistics:
1. For every word w in the dictionary ﬁle, count the number of words w0 that occur in the data ﬁle such that w0 = w. 2. For every word w in the dictionary ﬁle, count the number of words w0 that occur in the data ﬁle such that w is a proper preﬁx of w0 (we shall say w0 is a superword of w).
Write all the unique words in the dictionary along with these counts to the output ﬁle in lexicographical (i.e. alphabetical) order. Deﬁnition of a word: Any string of characters can be a word. For example in the sentence:
”a&ab&abc234 dfg”
the words are:
a, ab, abc, dfg
They do not need to be meaningful. Words, in both dictionary and data ﬁles, correspond to the longest continuous sequence of letters read from the ﬁle. Another way to say this is that words are any sequence of letters separated by non-letter characters (punctuation, numbers, whitespace, etc.). Each unique word is case-insensitive. That is, “boOK”, “Book” and “bOOk” are all occurrences of the same (unique) word “book”. Case-insensitivity also applies when matching preﬁxes: for example, both “Boo” and ”bOo” are proper preﬁxes of “bOOk”, which itself is a proper preﬁx of “booKING” (See the example below).
As an example, suppose the content of the dictionary ﬁle is:
boo22$Book5555bOoKiNg#bOo#TeX123tEXT(JOHN)
and that of the data ﬁle is:
John1TEXAN4isa1BOoRiSH%whohasa2bo3KING BOOKING bOoKIngs$12for a TEX-Text(BOOKS(textBOOKS)
Then, the various counts for the unique words in the dictionary ﬁle are:
1
Unique words No. of occurrences No. of superwords boo 0 4 book 0 3 booking 1 1 tex 1 3 text 1 1 john 1 0
Input/Output speciﬁcation
Usage interface
The program first should have the following usage interface:
first <mappingfile
where <mappingFile is the name of the mapping ﬁle. You can assume that the mapping ﬁle will exist and that it is well structured. So, unlike assignment1, you don’t need to check the structure of the mapping ﬁle
If no argument or more than one argument is provided, or the ﬁle names provided are invalid, the program should print “invalid input” and abort.
Input speciﬁcation
Here, and below, let m be the maximum number of words in either the dictionary or data ﬁles. Every word is of length at most k. Let n be the number of unique words in the dictionary ﬁle. Your input will be a mapping ﬁle, which contains lines of the form: hdictFilei hdataFilei, where dictFile and dataFile are the dictionary and data ﬁles for your program, respectively. An example of a mapping ﬁle is given below:
dict_1 data_1 dictm foom
The ﬁles: dict1.txt and dictm.txt are dictionary ﬁles and ﬁles: data1.txt and foom.txt are data ﬁles. They are plain text ﬁles with no special structure.
Output speciﬁcation
Your program should generate several output ﬁles outi.txt, where i is the line number in mapping ﬁle. It means that you need to get mapping ﬁle as an argument to your program. Then each line in the mapping ﬁle has information about the dictionary ﬁle and the data ﬁle. For example suppose line j in the mapping ﬁle is dict j data j. In this case you should produce outj.txt, which contains the described informations. Remember that you shouldn’t have any spaces at the end of the lines in your output ﬁles. Also, you shouldn’t have any empty lines in your outputs ﬁles . The program should write all the unique words (see deﬁnition above) that occur in the dictionary ﬁle along with their various counts (See above), in lexicographical order, one word per line to the output ﬁles i.e. the output should have exactly n lines:
<word1 <count11 <count12 <word2 <count21 <count22 . . . <wordn <countn1 <countn2
such that
1. <word1, <word2, <word3 ... <wordn is the lexicographical ordering of the unique words occurring in the dictionary,
2
2. <counti1 denotes the number of occurrences of <wordi in the data ﬁle,
3. <counti2 denotes the number of proper superwords of <wordi in the data ﬁle,
4. Line i is <wordi<space<counti1<space<counti2
Note that you can assume the dictionaries are non-empty.
Also note that if the mapping ﬁle has more than one line, you need to do same procedure for each line and create the result ﬁle (outi.txt) of every line.
For example, running first on the input described above should produce the following output: boo 0 4 book 0 3 booking 1 1 john 1 0 tex 1 3 text 1 1
Second Part (Extra Credit)
Your task in this second part is to write a C program called second that reads a input ﬁle, which contains a list of dictionary and data ﬁle pairs, and generates statistics for each pair. Each line in your input ﬁle contains the names of the dictionary and data ﬁles. You have to read the dictionary and data ﬁles and generate the following statistics:
1. For every word w in the dictionary ﬁle, count the number of words w0 that occur in the data ﬁle such that w0 = w. 2. For every word w in the dictionary ﬁle, count the number of words w0 that occur in the data ﬁle such that w0 is a proper preﬁx of w.
Write all the unique words in the dictionary along with these counts to the output ﬁle in lexicographical (i.e. alphabetical) order. A word is deﬁne the same way as in the ﬁrst part. Case-insensitivity also applies when matching preﬁxes: for example, both “Boo” and ”bOo” are proper preﬁxes of “bOOk”, which itself is a proper preﬁx of “booKING” (See the example below).
As an example, suppose the content of the dictionary ﬁle is:
boo22$Book5555bOoKiNg#bOo#TeX123tEXT(JOHN)
and that of the data ﬁle is:
John1TEXAN4isa1BOoRiSH%whohasa2bo3KING BOOKING bOoKIngs$12for a TEX-Text(BOOKS(textBOOKS)
Then, the various counts for the unique words in the dictionary ﬁle are:
Unique words No. of occurrences No. of preﬁxes boo 0 1 book 0 1 booking 1 1 tex 1 0 text 1 1 john 1 0
3
Input/Output speciﬁcation
Usage interface
The program second should have the following usage interface:
second <mappingFile
where <mappingFile is the name of the mapping ﬁle. You can assume that the mapping ﬁle will exist and that it is well structured. So, unlike assignment1, you don’t need to check the structure of the mapping ﬁle.
In case no argument or more than one argument is provided, or the ﬁle names provided are invalid, the program should print “invalid input” and abort.
Input speciﬁcation
Here, and below, let m be the maximum number of words in either the dictionary or data ﬁles. Every word is of length at most k. Let n be the number of unique words in the dictionary ﬁle. Your input will be a mapping ﬁle, which contains lines of the form: <dictFile <dataFile dictFile and dataFile are the dictionary and data ﬁles for your program, respectively. They are plain text ﬁles with no special structure.
Output speciﬁcation
Your program should generate several output ﬁles outi.txt, where i is the line number in mapping ﬁle. It means that you need to get mapping ﬁle as an argument to your program. Then each line in the mapping ﬁle has information about the dictionary ﬁle and the data ﬁle. For example suppose line j in the mapping ﬁle is dict j data j. In this case you should produce outj.txt, which contains the described informations. Remember that you shouldn’t have any spaces at the end of the lines in your output ﬁles. Also, you shouldn’t have any empty lines in your outputs ﬁles . The program should write all the unique words (see deﬁnition above) that occur in the dictionary ﬁle along with their various counts (See above), in lexicographical order, one word per line to the output ﬁles i.e. the output should have exactly n lines:
<word1 <count11 <count12 <word2 <count21 <count22 . . . <wordn <countn1 <countn2
such that
1. <word1, <word2, <word3 ... <wordn is the lexicographical ordering of the unique words occurring in the dictionary,
2. <counti1 denotes the number of occurrences of <wordi in the data ﬁle,
3. <counti2 denotes the number of proper preﬁxes of <wordi in the data ﬁle,
4. Line i is <wordi<space<counti1<space<counti2
Note that you can assume the dictionaries are non-empty.
Also note that if the mapping ﬁle has more than one line, you need to do same procedure for each line and create the result ﬁle (outi.txt) of every line.
For example, running second on the input describe above should produce the following output: boo 0 1 book 0 1
4
booking 1 1 john 1 0 tex 1 0 text 1 1
Design & Implementation
Design
We recommend using the data structure “trie” (http://en.wikipedia.org/wiki/Trie), though you are free to use whatever data structures and algorithms you want as long as your implementation is reasonably eﬃcient (See the Grading Section for detailed eﬃciency requirements).
Implementation
As an additional requirement, your code should implement and use the following functions: • void readDict(FILE *dict file): The function takes a ﬁle pointer to the dictionary ﬁle and reads the unique words from the dictionary ﬁle and stores them in an appropriate data structure. • void matchStr(char* str): This function will take a word and search the data structure that holds the unique dictionary words in order to ﬁnd matches and to update the various counts. This function should be used while scanning the data ﬁle for occurrences of dictionary words and their preﬁxes and superwords. • void printResult(): This function will produce the output of the program. You are free to add any other functions you need.
You are allowed to use functions from standard libraries (e.g., strcmp()) but you cannot use thirdparty libraries downloaded from the Internet (or from anywhere else). If you are unsure whether you can use something, ask us.
We will compile and test your program on the iLab machines so you should make sure that your program compiles and runs correctly on these machines. You must compile all C code using the gcc compiler with the -Wall -Werror ﬂags.

Shopping cart

US$0

Assignment 2: Dictionary Stat Generator

More products