Starting from:
$30

$27

CS322 Assignment 3

CS322 Assignment 3
Written Questions: For the written portion, you will be exploring vector semantics of
words, using a pretrained word2vec model. The word2vec model you’ll be using can be downloaded at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit.
These embeddings result from training the word2vec model described in §6.8 of Jurafsky
and Manning on approximately 100 Billion tokens, scraped from news articles. The vector
dimension is 300. For the “written” questions, you’ll be writing some python scripts to
explore these relationships between these pretrained embeddings. You don’t need to turn in
these python scripts, but you will need to describe some of the outputs. For this purpose,
you will be using the gensim1
library’s functions for word2vec model loading/exploring.
The documentation for the useful functions are available here: https://radimrehurek.
com/gensim/models/word2vec.html
Because the word2vec embeddings take a significant amount of time to load, (on my
machine, they take about 30 seconds to load into memory) I’d highly recommend either a)
using an ipython notebook to explore them, so you don’t need to re-load every time you want
to try something new, or b) using the limit parameter of gensim.models.keyedvectors.
KeyedVectors.load word2vec format to limit the number of vectors you load during testing.
1. Load the word2vec model using gensim.models.keyedvectors.KeyedVectors.load -
word2vec format. Use the python help function and the documentation to become
familiar with the functions/class members of this object. Note that the word2vec KeyedVectors object has a dictionary member vocab. What is the length of this vocabulary
— i.e., how many words are contained in the model? Given the vocabulary size — how
many parameters does this model have?
2. Re-load the model, except load with the parameter limit=500000 to limit the model to
load the 500K most common words (otherwise, you’ll run into out-of-memory issues!).
Use the most similar function to find the most similar 10 words to “happy”, “sad”,
“Twitter”, (“Carleton” + “College”), and one word of your choosing. Does anything
interesting pop out? What is the cosine similarity between “happy” and “overjoyed”?
What is the cosine similarity between “Twitter” and “Facebook”?
3. Using the same function, what are the closest neighbors to “king” - “man” + “woman”
? What do you observe? What about: “Ottawa” - “Canada” + “Brazil”? What do
you observe? What about: “Carleton” + “College” - “Northfield” + “Twin Cities”?
Try an experiment of this form using your own words. Before running it — what word
do you expect to be in among the top most similar? Did it appear?
4. Can the average word2vec embedding help us figure out what movie we’ve seen, even
if we can’t remember the title? Say that we saw a movie with the characters named:
“Neo”, “Trinity”, “Morpheus”, “Cypher”, and “Mouse”, but can’t remember the name.
What would you guess the name of this movie is, based on the most similar word2vec
embedding to the sum of the vectors? What is the associated cosine similarity for that
1
https://github.com/RaRe-Technologies/gensim
CS322, Spring 2019 Assignment 3
movie? (If you don’t know the movie that we are searching for — you are welcome to
google this list of character names for a hint).
5. Speaking of which, can the word2vec model recommend us movies? Say we saw a
movie called “Bourne Identity” (this is a spy-thriller movie starring Matt Damon: it’s...
okay.). According to word2vec, what are the most similar 3 movies to this one?
6. Try putting in the name of your favorite movie (at least — the name of your favorite movie popular enough to be in the 500K most common vocabulary words in the
model...). If you can’t think of one — try your favorite book or song instead (keeping
in mind that this model was trained using news articles from several years ago). Do
you agree with the most similar vectors? Were there any surprises?
CS322, Spring 2019 Assignment 3
Programming Assignment: For the programming assignment, you will be exploring the
relationship between text documents. Specifically, we’ll be taking a break from movie reviews
to check out 45K Kickstarter projects! If you don’t know Kickstarter, you can check out the
crowdsourcing site here: www.kickstarter.com. To quote from their site: “Kickstarter helps
artists, musicians, filmmakers, designers, and other creators find the resources and support
they need to make their ideas a reality. To date, tens of thousands of creative projects big
and small have come to life with the support of the Kickstarter community.” In short —
folks write project descriptions in the hopes that folks will offer them funding. We will be
analyzing the descriptions of those projects.
Please program your solutions to the following answers inside the main function in the
indicated positions. You’re welcome to use helper functions, but please retain the question
0/1/2/3/4 annotations in the comments for grading purposes.
Finally — I do want you to run this code over all 45K projects, but I have included a
project count limiter by default set to 5K — this is to help your program run faster as you
develop it. Once you are happy with your code, do remove this limit by setting it to be
None. Once I set this value — my code takes around 5 minutes to run (annoyingly slow for
debugging... but not unreasonable) — with the slowest pieces being the truncated SVDs,
and the TSNE.
0. Print out the title/description of the first project in the list. What is it? What is it
about?
1. Extract vector representations of each document using sklearn.feature extraction.
text.TfidfVectorizer. Set the parameters of the TfidfVectorizer so that there are
no more than 20K words in the vocabulary, each word appears at least 20 times, and
no word appears in more than 30% of documents. These constraints are satisfiable
by setting parameters when constructing your TfidfVectorizer object — you should
not code them up yourself, rather — you should read the documentation and set the
values appropriately. Then, use sklearn.preprocessing.normalize to L2 normalize
the resulting document-term matrix by row, so that each document is represented by
a magnitude 1 tfidf vector.
2. The 134th document is called “The art of sloth survival”. Find the 10 most similar
documents to “The art of sloth survival” according to tfidf cosine similarity. What are
the titles of these most similar projects?
3. Use a truncated singular value decomposition (SVD) (sklearn.decomposition.TruncatedSVD)
to reduce the dimensionality of the document-term matrix to 500 dimensions (use the
n components argument). Make (and include in your submission) a plot of the magnitude of the singular values as the dimension ranges from 1 to 500 (as a hint — these
are stored in the TruncatedSVD object under the name singular values ). The n
th
singular value of a singular value decomposition corresponds (roughly) to how much of
the original data matrix is being explained by including that dimension. Based on the
CS322, Spring 2019 Assignment 3
“knee point” of this plot — decide on a new dimension less than 500 that represents
a good trade-off between compactness and proportion of the original data explained.
What value did you decide on and why?
4. Compute a new truncated SVD with your newly chosen dimension (use the n components argument); then, use sklearn.manifold.TSNE to project the first 3K datapoints
(i.e., first 3K Kickstarter projects) of your new data matrix (i.e., the one that is projected to your newly chosen dimension less than 500) down to 2 dimensions (use the
n components argument). Finally — make a scatterplot of the first 3K kickstarter
projects in your data matrix using the 2D output of the TSNE as the x and y coordinates. Include this plot, along with some observations about it, with your submission.
I’d highly recommend passing the verbose=1 argument to TSNE so you can track its
progress.

More products