Starting from:

$30

HW8: Word Clouds and BSTs

HW8: Word Clouds and BSTs [PAIR]
This assignment is due by 1 PM on Wednesday, February 28 and is worth 24 points.
Goals
The goal of this assignment is to give you practice with implementing and navigating binary
search trees. You will also see the idea of a "dictionary" or "map" put to use in a text
analysis application.
Setup and Instructions
This is a partner project, and so should be worked on with the partner I assigned to you (if
you have a partner). Remember that pair programming requires you to be working at the
same computer at the same time!
Download HW8.zip and unzip it into the directory that you've created for your work on this
assignment. There is a Java class called WordCloudMaker.java (originally written by Sherri
Goings, with a growing list of modifications by others). There are also a few sample
documents and a stopwords file called StopWords.txt.
Building a Word Cloud
You may have seen word clouds like the one below, which was generated from the text
from "Anne of Green Gables" by Lucy Maud Montgomery. A word cloud takes the most
frequent words in a document and displays them in a jumbled arrangement. The key
feature of a word cloud is that the size of each word is proportional to the number of
times that word appears in the text. (Note that very common words, generally called
"stopwords," are excluded from the word cloud. Otherwise, our clouds would be
dominated by "the", "and", "a", etc.)
In this assignment, you will implement a binary search tree to count the number of times
each word appears in a given text, allowing you to make your own word clouds!
Overall Program Structure
Your program will come in three main pieces:
• A class called WordCountMap consisting of a binary search tree that maintains a
record of words and their counts. (WordCountMap will make use of two very small
classes called Node and WordCount described below.)
• A class called WordCounter containing your main program and any support
methods it needs. The main program will be in charge of opening a text file,
counting all of the words it contains, and producing one of three types of output.
• A class WordCloudMaker.java (provided to you) that will transform a list of word
counts into HTML that shows a simple word cloud.
WordCountMap
First you'll build a class called WordCountMap. Each object of this class will represent
a binary search tree where each node of the tree contains a word and the count of the
number of times that word appears. The tree should be organized according to
the alphabetical ordering of the words. (What instance variables might you want for a
binary search tree?)
Specifically, this class will need to make use of two other classes. The first is a Node class
which should look something like the following:
private class Node {
private String word;
private int count;
private Node left;
private Node right;
}
The exact details of the Node class are up to you, since it's a class that will be invisible to
users of WordCountMap. You may add other (private) methods to this class, but it should be
a nested class within your WordCountMap class.
The other class you will need is a WordCount class. This small class will allow you to create
an object that contains a word and its associated count. Please make this class in a
separate file WordCount.java. This is a relatively simple class that should look something
like the following (plus whatever methods/constructors you choose to add):
public class WordCount implements Comparable<WordCount{
public String word;
public int count;
}
You must use these exact instance variables and make them public, as
the WordCloudMaker class will depend on accessing them directly. Since this class
implements the Comparable interface, you'll have to implement a compareTo method. In
particular, you should use the instance variable count to do this comparison. If you aren't
sure what this method should do, take a closer look at the Javadoc description of it in
the Comparable interface.
Your WordCountMap class will need to implement the following three methods:
/**
* If the specified word is already in this WordCountMap, then its
* count is increased by one. Otherwise, the word is added to this map
* with a count of 1.
*/
public void incrementCount(String word) {
...
}

/**
* Returns an array list of WordCount objects, one per word stored in this
* WordCountMap, sorted in decreasing order by count.
*/
public ArrayList<WordCount getWordCountsByCount() {
...
}

/**
* Returns a list of WordCount objects, one per word stored in this
* WordCountMap, sorted alphabetically by word.
*/
public ArrayList<WordCount getWordCountsByWord() {
...
}
You may (and probably should!) create other methods to help you implement these
methods in your class. Make sure that you implement these methods exactly as described,
as the grader and I will be building some testing scripts that use these methods. Lastly,
note that ArrayList should be the built-in Java implementation of a List that uses an array
to store the data.
WordCounter
This class will contain your main program and any support methods it needs. The main
program will be in charge of opening a text file, counting all of the words it contains, and
producing one of three types of output. Specifically, your main method here will parse the
command-line arguments to support the following three ways of running the program:
1. You should be able to run the program using the following commands:
2. java WordCounter alphabetical [textFileName]
This will print out a list of words and their occurrence counts, one word per line,
each line consisting of a word, a colon, and the word's count. This list will be sorted
alphabetically by word. For example, the output might look like the following:
ealexander$ java WordCounter alphabetical FederalistPapers.txt
abandon: 5
abandoned: 3
abandoning: 1
abate: 1
abatements: 1
abbe: 4
abetted: 2
abhorrence: 1
abide: 1
...
3. You should be able to run the program using the following commands:
4. java WordCounter frequency [textFileName]
This is the same as the the alphabetical case, except the words will be sorted in
decreasing order by frequency (or count). For example, the output might look like
the following:
ealexander$ java WordCounter frequency FederalistPapers.txt
states: 865
government: 830
state: 792
its: 658
power: 620
people: 615
no: 601
those: 481
constitution: 468
...
5. You should be able to run the program using the following commands:
6. java WordCounter cloud [textFileName] [numberOfWordsToInclude]
This command will write HTML to a file containing a word cloud based on the code
I've given you (read through the code and comments in WordCloudMaker.java to
figure out what methods you should call to do this). A typical invocation of the cloud
generator would be:
ealexander$ java WordCounter cloud FederalistPapers.txt 40
This would generate the word cloud based on the 40 most common non-stopwords
in FederalistPapers.txt. (If FederalistPapers.txt contains fewer than 40 nonstopwords, then the cloud will just use all the words.) In your code, you will need to
create a reasonable filename to contain the HTML. Make sure that it has the
extension .html at the end of it. Ideally, you would take the name of the text file
being passed in and use that to name the HTML file, though I will not take off points
for how you create the filename. So, for instance, the command:
ealexander$ java WordCounter cloud FederalistPapers.txt 40
would create a file called FederalistPapers.html.
Other Notes, Requirements and Suggestions
• Make sure that WordCountMap implements a binary search tree based on
the alphabetical ordering of the words. You should not use other data structures
included with Java to do this. You must implement the tree structure directly
yourself.
• I've provided you a file called StopWords.txt that contains a list of common English
stop words. You should write your code so that any word which appears in this list
is not included in the results that you produce. How exactly you choose to do this is
up to you, but you should think about the efficiency of your code for this portion.
• At least one of your getWordCountsBy methods in your WordCountMap should use a
tree traversal to get the list of words and counts in the correct order. For the other
method, you can use built in Java methods if you'd like (such
as Collections.sort()).
• We talked briefly about command-line arguments at the beginning of the term,
though we did not dive into them in great depth. If you don't remember this well,
look back at the code I gave you for HW1 (LineReader.java) or in section A.10 of
your book for more information---or ask questions about it!
• I do not care if you treat "alacrity" or "Alacrity" as the same word or different words
for the purpose of this assignment. However, note that StopWords.txt only
contains words in lower case. As such, when determining whether a word is a
stopword, case should be ignored.
• When reading from a text file, you should strip punctuations from words. One way
to do this is to use the String method replaceAll(). In particular, this method can
use something called a regular expression to identify and then remove certain
patterns of characters. If you want to find and remove any character in a
String s that is NOT a letter, you could use the following line of code:
• s = s.replaceAll("[^a-zA-Z]","");
• You do NOT need to make any changes to WordCloudMaker.java. But if you want
to make changes, go ahead! The same applies to StopWords.txt.
• Feel free to try your word cloud on other documents! Many can be found at Project
Gutenberg.
• Keep modularity and encapsulation in mind as you design your code. Make sure
your comments indicate what the instance variables you choose represent.
Submission and Grading
Create a zip file of the following, to be handed in via Moodle:
• WordCountMap.java
• WordCounter.java
• WordCloudMaker.java
• WordCount.java
• Any other files used by your code.
Only one partner from each pair needs to submit the assignment. Specifically, put these
files into a directory named [your_last_name_your_partner's_last_name]HW8, zip this
directory, upload it to Moodle.
This assignment is worth 24 points.
• 18 points for the assignment requirements.
• 6 points for comments, style and design.
Extra Challenge?
There are a number of things you can do with this assignment if you are looking for an
extra challenge. Here are a few suggestions.
• Implement a self balancing tree (although if you do this, please do submit your code
for your regular binary search tree as well). Here's your chance to practice AVL
Trees, or to try out one of the other self-balancing trees discussed in the book!
• Add additional functionality to your WordCountMap class. Maybe
a decremenCount() method. Maybe add a different kind of tree traversal.
• Play around with the WordCloudMaker.java code to change the look of the cloud
you create. How might we adjust position or font based off of the data being
shown? Are there other uses for color that we might want?
Start early, ask lots of questions, and have fun! Eric, the lab assistants, and the prefect are
all here to help you succeed!

More products