$29.99
Machine Learning (B555)
Programming Project 4
Overview: Experiments with LDA
In this programming project we implement Latent Dirichlet Allocation (LDA) and inspect its
performance both in an unsupervised manner and when used as a preprocessing step for supervised
learning. Your goals in this assignment are to (i) implement the collapsed Gibbs sampler for LDA
inference, and (ii) compare the LDA topic representation to a “bag-of-words” representation with
respect to how well they support document classification.
Data
Data for this assignment is provided in a zip file pp4data.zip on Canvas. Each dataset is given in
a separate sub-directory.
We will use a subset of the well known 20 newsgroups dataset. The subset consists of 200 documents
that have been pre-processed and “cleaned” so they do not require any further manipulation, i.e.,
you only have to read the space-separated strings from the ASCII text files. Each document belongs
to one of two classes. The file index.csv holds the true labels of each document. Labels are ignored
in task 1 but used in task 2.
We have prepared an additional small dataset, artificial, for developing your implementation. Running your sampler on this dataset with K = 2 and parameters as below, you should find that the
three most frequent words in the two topics are {bank, river, water} and {dollars, bank, loan} (not
necessarily in those orders).
1
Task 1: Gibbs Sampling
In this portion, your task is to implement the collapsed Gibbs sampler for LDA. In the case of
LDA, the output represents a sample of the (hidden) topic variables for each word. Recall that in
LDA we sample the hidden topic variables associated with words in the text. This sample of topic
variables can be used to calculate topic representations per document. Algorithm 1 describes one
possible implementation of the collapsed Gibbs sampler.
For this project, fix the number of iterations to run the sampler at Niters = 500. The Dirichlet
parameter for the topic distribution is α1 where 1 is a vector of ones with K entries (K is the
number of topics), and α =
5
K
. The Dirichlet parameter for the word distribution is β1 where 1 is
a vector of ones with V entries (V is the size of the vocabulary), and β = 0.01.
We suggest testing your implementation first on the artificial dataset with K = 2 because you
know what to expect and the run time is shorter. Once you have verified that your implementation
works correctly, run your sampler with K = 20 on the 20 newsgroups dataset. After the sampler
has finished running, output the 5 most frequent words of each topic into a CSV file, topicwords.csv,
where each row represents a topic. Include these results in both your report and submission. In
your report discuss the results obtained (i.e., the topics). Do the topics obtained make sense for
the dataset?
Finally, you will need the topic representations for the next part. For a document doc, this will be
a vector of K values, one for each topic, where the kth value is given by Cd(doc,k)+α
Kα+
P
l
Cd(doc,l)
and Cd is
output from the sampler.
Task 2: Classification
In this portion we will evaluate the dimensionality reduction accomplished by LDA in its ability to
support document classification and compare it to the bag of words representation.
The first step is to prepare the data files for the two representations. The first is given by the topic
representation of the previous section, where each document is represented by a feature vector of
length K. The second representation is the “bag-of-words” representation. This representation has
a feature for each word in the vocabulary and the value of this feature is the number of occurrences
of the corresponding word in the document.
For the evaluation we will reuse the logistic regression implementation from project 3, in particular,
your implementation of Newton’s method for this problem. You should use the value α = 0.01 for
the regularization parameter of logistic regression in this part.
Your task is to generate learning curves in the same way you did there: Step 1) Set aside 1/3 of the
total data (randomly selected) to use as a test set. Step 2) Record test set classification error as
a function of increasing training set size (with each training set randomly selected from the other
2/3 of the total data). Repeat Steps 1 & 2 a total of 30 times to generate learning curves with
error bars (i.e., ±1σ). Performance is defined as classification accuracy on the test set.
Plot the learning curve performance of the logistic regression algorithm (with error bars) on the
two representations. Then discuss your observations on the results obtained, as well as the run
time of the algorithms.
2
Algorithm 1 Collapsed Gibbs sampler for LDA
Require: Number of topics K, Dirichlet parameter for topic distribution α, Dirichlet parameter
for word distribution β, number of iterations to run sampler Niters, array of word indices w(n),
array of document indices d(n), and array of initial topic indices z(n), where n = 1 . . . Nwords
and Nwords is the total amount of words in the corpus.
1: Generate a random permutation π(n) of the set {1, 2, . . . , Nwords}
2: Initialize a D×K matrix of topic counts per document Cd, where D is the number of documents
3: Initialize a K × V matrix of word counts per topic Ct
, where V is the number of words in the
vocabulary
4: Initialize a 1 × K array of probabilities P (to zero)
5: for i = 1 to Niters do
6: for n = 1 to Nwords do
7: word ← w(π(n))
8: topic ← z(π(n))
9: doc ← d(π(n))
10: Cd(doc, topic) ← Cd(doc, topic) − 1
11: Ct(topic, word) ← Ct(topic, word) − 1
12: for k = 1 to K do
13: P(k) = Ct(k,word)+β
V β+
P
j
Ct(k,j)
Cd(doc,k)+α
Kα+
P
l
Cd(doc,l)
14: end for
15: P ← normalize P
16: topic ← sample from P
17: z(π(n)) ← topic
18: Cd(doc, topic) ← Cd(doc, topic) + 1
19: Ct(topic, word) ← Ct(topic, word) + 1
20: end for
21: end for
22: return {z(n)}, Cd, Ct
Additional Notes
• Please submit the logistic regression implementation with this project so that your code can
be used and tested without further manipulation.
• The run time for this project is non-negligible. A Matlab / Python implementation of the
collapsed Gibbs sampler for the 20 newsgroups dataset might take minutes 10 or more to
run and might be longer if your implementation is not optimized. The many runs of logistic
regression for the learning curves also require some time.
Submission
Please submit two separate items via Canvas:
(1) A zip file pp4.zip with all your work and the report. The zip file should include: (1a) Please
write a report on the experiments, include all plots and results, and your conclusions as requested
above. Prepare a PDF file with this report. (1b) Your code for the assignment, including a
3
README file that explains how to run it. When run your code should produce all the results and
plots as requested above. Your code should assume that the data files will have names as specified
above and will reside in sub-directory pp4data/ of the directory where the code is executed. We
will read your code as part of the grading – please make sure the code is well structured and easy
to follow (i.e., document it as needed). This portion can be a single file or multiple files.
(2) One PDF “printout” of all contents in 1a,1b: call this YourName-pp4-everything.pdf. One
PDF file which includes the report, a printout of the code and the README file. We will use
this file as a primary point for reading your submission and providing feedback so please include
anything pertinent here.
Grading
Your assignment will be graded based on (1) the clarity of the code, (2) its correctness, (3) the
presentation and discussion of the results, (4) The README file and our ability to follow the
instructions and test the code.
4