Starting from:

$30

Project 1: End-to-End Pipeline to Classify News Articles

Large-Scale Data Mining: Models and Algorithms ECE 219
Project 1: End-to-End Pipeline to Classify News
Articles

Overview
Statistical classification broadly refers to the task of learning to identify a subset of categories
that pertain to a data point (sample of text, an image, a video clip, a time-signal etc..) from
a predefined (generally human-guided) larger set of categories. The model attempts to master
the task given a training data set (that is kept separate from an evaluation set) in which each
data point is pre-labeled with their “correct” category membership/s. In this project, we deal
with the classification of text data. The project consists of building an end-to-end pipeline to
classify samples of news articles and involves the following ML components:
1. Feature Extraction: Construction of TF-IDF representations of textual data;
2. Dimensionality Reduction: Principal Component Analysis (PCA) and non-Negative
Matrix Factorization (NMF) - Generally necessary for classical ML methods
3. Application of Simple Classification Models: Applying common classification methods to the extracted features such as Logistic/Linear Classification and Support Vector
Machines;
4. Evaluation the Pipeline: Evaluating and diagnosing classification results using GridSearch and Cross Validation;
5. Replace corpus-level features with pretrained features: To apply pre-training to
a downstream classification task and evaluate comparative performance.
Getting familiar with the dataset
Please access the dataset at this link. We are using a custom dataset that was designed
specifically for this quarter 1
. Note: Do not attempt to open the downloaded file in Excel -
the formatting of file when visualized in Excel might suggest the data is corrupted, but this is
not true. Consider exploring the dataset using Pandas. You might find the following Pandas
functions helpful (in no specific order): read csv, head, hist, shape.
QUESTION 1: Provide answers to the following questions:
• Overview: How many rows (samples) and columns (features) are present in the dataset?
1This dataset was extracted from the search feature in The GDELT Project and a recursive crawler that traverses
resulting news links
1
• Histograms: Plot 3 histograms on : (a) The total number of alpha-numeric characters per
data point (row) in the feature full text: i.e count on the x-axis and frequency on the y-axis;
(b) The column leaf label – class on the x-axis; (c) The column root label – class on the
x-axis.
• Interpret Plots: Provide qualitative interpretations of the histograms.
The two sets of labels leaf label and root label are hierarchically arranged as follows:
root label sports climate
leaf label chess cricket hockey soccer football %22forest%20fire%22 flood earthquake drought
Binary Classification
For the first part of the project, we will be using only the full text column as the raw
features per sample (row) and the root label column as the label for each sample. The
root labels are well-separated. Before continuing on, please set the random seed as follows to
ensure consistency:
import numpy as np
import random
np.random.seed(42)
random.seed(42)
1 Splitting the entire dataset into training and testing data
In order to measure the performance of our binary classification model, we split the dataset
into a training and a testing set. The model is trained on the training set and evaluated on the
testing set. Note: Do not train on the testing set. We create the sets with a Pandas dataframe
input that contains the entire dataset df. Please make sure that the random seeds are
set and the fraction of the test set is 0.2:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df[["full_text","root_label"]], test_size=0.2)
train and test contain the dataframes containing specific rows pertaining to the training data
and testing data respectively.
QUESTION 2: Report the number of training and testing samples.
2 Feature Extraction
The primary step in classifying a corpus of text is choosing a good representation of each data
point. Since the full text column contains the raw features to describe each data point, we
seek a feature extraction module that encodes raw text features into processed computationally
compatible features.
2
A good representation should retain enough class-discriminating information post-processing
to competitively perform classification, yet in the meantime, be concise to avoid computational
intractability and over fitting.
A first model: The Bag-of-Words (BOW): One feature extraction technique is the “Bag
of Words” model, where a document – in this case the full text feature of one data point –
is represented as a histogram of word frequencies, or other statistics, within a fixed vocabulary
of words. The vocabulary is aggregated from the training set (process described below).
Compiling the Vocabulary: (a) Each raw feature (text segment) is split into sentences;
(b) Each sentence is split into words; (c) The list of words PER sentence are passed jointly
into a position tagger to identify the nouns, verbs, adjectives etc.; (d) These tags are used to
lemmatize/stem 2
each word; (e) Words are filtered: Very rare or very frequent words (the
number of documents they occur in, or the number of times they occur within a document) are
removed, digits and punctuation-dominant words are removed, and stopwords, words contained
in a database of common words are removed; (f) Remaining words are added to the vocabulary.
Say that the selected set of words that compose the vocabulary form a set W. For a dataset D,
the processed features can be collectively represented as a data matrix X ∈ R
|D×|W|. So each
row captures the count of each word (the histogram) per data point.
An Example: If a test data point’s raw feature contains the text,
“On Saturday, the NHL hockey team went to the school to volunteer and educate.
Outreach is required for hockey, which is dying in popularity in recent years.”
and W = [“hockey′′
, “volunteer′′
, “sport′′], the row in X corresponding to this data point
would be [2, 1, 0] because “hockey” appears twice, “volunteer” appears once, and “sport” does
not appear at all (though it might appear in another data point. Remember the vocabulary is
aggregated across the training set.)
During Testing: Each text sample in the testing set is similarly processed to those during
training; however words are no longer added to W – this was fixed during training.
Instead if a word that exists in the vocabulary occurs in the processed words from a testing
sample, its count is incremented in the resulting feature matrix X.
To avoid adding to the vocabulary during testing and to avoid training on the test set in
general please note as a rule: For most of this project you will be using the NLTK and sci-kit
learn (sklearn) libraries for text processing and classifier design. In sklearn, in particular, only
use the functions fit transform and fit on the training set and only use transform on the
testing set.
The better model: The Term Frequency-Inverse Document Frequency Model (TFIDF): “document” and “data point” are used interchangeably While Bag-of-Words model continues to be used in many feature extraction pipelines, a normalized count vector that not only
counts the number of times a word occurs in a document (the term frequency) but also scales
the resulting count by the number of documents the word appears in (the document frequency)
might provide a more valuable feature extraction approach.
The focus on the document frequency (and more correctly the inverse of it) encourages the
2Lemmatization: Lemmatization uses a prelearned vocabulary and morphological information of words in-sentence
context, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which
is known as the lemma.
Stemming: Stemming is a crude heuristic that removes the rightmost characters of words in the hope of mapping
multiple word derivatives to the same root word.
If confronted with the word “saw”, stemming might return just “s”, whereas lemmatization would attempt to return
either “see” or “saw” depending on whether the use of the word was as a verb or a noun.
3
feature vector to discriminate between documents as well as represent a document. TF-IDF
does not only ask “What is the frequency of different words in a document?” but rather “What
is the frequency of words in a document specific to that document and which differentiates it
from other documents? A human reading a particular news article in the sports section will
usually ignore the contextually dominant words such as “sport”, “competition”, and “player”
despite these words being frequent in every news article in the sports section.
Such a context-based conditioning of information is widely observed. The human perception
system usually applies a saturating function (such as a logarithm or square-root) to the actual
input values into the vision model before passing it on to the neuronal network in the brain.
This makes sure that a contextually dominant signal does not overwhelm the decision-making
processes in the brain. The TF-IDF functions draw their inspiration from such neuronal systems. Here we define the TF-IDF score to be:
TF-IDF(d, t) = TF(t, d) × IDF(t)
where TF(d, t) represents the frequency of word (processed, lemmatized, otherwise filtered) t
in document d, and inverse document frequency is defined as:
IDF(t) = log 
n
DF(t)

+ 1
where n is the total number of documents, and df(t) is the document frequency, i.e. the number
of documents that contain the word t.
import re
def clean(text):
text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
texter = re.sub(r"<br />", " ", text)
texter = re.sub(r"&quot;", "\"",texter)
texter = re.sub('&#39;', "\"", texter)
texter = re.sub('\n', " ", texter)
texter = re.sub(' u '," you ", texter)
texter = re.sub('`',"", texter)
texter = re.sub(' +', ' ', texter)
texter = re.sub(r"(!)\1+", r"!", texter)
texter = re.sub(r"(\?)\1+", r"?", texter)
texter = re.sub('&amp;', 'and', texter)
texter = re.sub('\r', ' ',texter)
clean = re.compile('<.*?>')
texter = texter.encode('ascii', 'ignore').decode('ascii')
texter = re.sub(clean, '', texter)
if texter == "":
texter = ""
return texter
QUESTION 3: Use the following specs to extract features from the textual data:
• Before doing anything, please clean each data sample using the code block provided above.
This function helps remove many but not all HTML artefacts from the crawler’s output. You
can also build your own cleaning module if you find this function to be ineffective.
• Use the “english” stopwords of the CountVectorizer
4
• Exclude terms that are numbers (e.g. “123”, “-45”, “6.7” etc.)
• Perform lemmatization with nltk.wordnet.WordNetLemmatizer and pos tag
• Use min df=3
Please answer the following questions:
• What are the pros and cons of lemmatization versus stemming? How do these processes affect
the dictionary size?
• min df means minimum document frequency. How does varying min df change the TF-IDF
matrix?
• Should I remove stopwords before or after lemmatizing? Should I remove punctuations before
or after lemmatizing? Should I remove numbers before or after lemmatizing? Hint: Recall
that the full sentence is input into the Lemmatizer and the lemmatizer is tagging the position
of every word based on the sentence structure.
• Report the shape of the TF-IDF-processed train and test matrices. The number of rows should
match the results of Question 2. The number of columns should roughly be in the order of
k×103
. This dimension will vary depending on your exact method of cleaning and lemmatizing
and that is okay.
The following functions in sklearn will be useful: CountVectorizer, TfidfTransformer, About Lemmatization and
for the daring, Pipeline. Please refer to the discussion section notebooks for more guidance.
3 Dimensionality Reduction
After applying the above operations, the dimensionality of the representation vectors (TF-IDF
vectors) is large. Classical learning algorithms, like the ones required in this section, however,
may perform poorly with such high-dimensional data. Since the TF-IDF matrix is sparse and
low-rank, as a remedy, one can project the points from the larger dimensional space to a lower
dimension.
In this project, we use two dimensionality reduction methods: Latent Semantic Indexing (LSI)
and Non-negative Matrix Factorization (NMF), both of which minimize the Mean Squared
residual Error (MSE) between the original TF-IDF data matrix and a reconstruction of the
matrix from its low-dimensional approximation. Recall that our data is the term-document
TF-IDF matrix, whose rows correspond to TF-IDF representation of the documents, i.e.
X =




tfidf(d1, t1) · · · tfidf(d1, tm)
tfidf(d2, t1) · · · tfidf(d2, tm)
.
.
.
.
.
.
.
.
.
tfidf(dn, t1) · · · tfidf(dn, tm)




.
Latent Semantic Indexing (LSI): The LSI representation is obtained by computing left and
right singular vectors corresponding to the top k largest singular values of the term-document
TF-IDF matrix X.
We perform Singular Value Decomposition (SVD) on the matrix X, resulting in X = UΣVT
where U and V orthogonal. Let the singular values in Σ be sorted in descending order, then
5
the first k columns of U and V are called Uk and Vk respectively. Vk consists of the principle
components of matrix X in the feature space.
Then we use (XVk) (which is also equal to (UkΣk)) as the dimension-reduced data matrix,
where rows still correspond to documents, only now each data point can be represented in a
(far) lower dimensional space. In this way, the number of features is reduced. LSI is similar to
Principal Component Analysis (PCA), and you can see the lecture notes for their relationships.
Having obtained U and V, to reduce the test data, we ONLY multiply the test TF-IDF matrix
Xt by Vk, i.e. Xt,reduced = XtVk. By doing so, we actually project the test TF-IDF vectors
onto the previously learned principle components from training, and use the projections as the
dimensionality-reduced data.
Non-negative Matrix Factorization (NMF): NMF tries to approximate the data matrix
X ∈ R
n×m (n = |D| docs and m = |W| terms) with WH (W ∈ R
n×r
, H ∈ R
r×m). Concretely,
it finds the non-negative matrices W and H s.t. ∥X − WH∥
2
F
is minimized (∥A∥F ≡
sX
i,j
A
2
ij
).
Then we use W as the dim-reduced data matrix, and in the fit step, we calculate both W
and H. The intuition behind this is that we are trying to describe the documents (the rows in
X) as a (non-negative) linear combination of r topics:
X =


x
T
1
.
.
.
x
T
n

 ≈ WH =


w11 w12 · · · w1r
.
.
.
.
.
.
.
.
.
.
.
.
wn1 wn2 · · · wmr





h
T
1
.
.
.
h
T
r



Here we see h
T
1
, . . . , h
T
r
as r “topics”, each of which consists of m scores, indicating how important each term is in the topic. Then x
T
i ≈ wi1h
T
1 + wi2h
T
2 + · · · + wirh
T
r
, i = 1, . . . , n.
Now how do we calculate the dim-reduced test data matrix? Again, we try to describe the
document vectors (rows by our convention here) in the test data (call it Xt) with (non-negative)
linear combinations of the “topics” we learned in the fit step. The “topics”, again, are the
rows of H matrix, {h
T
i }
r
i=1. How do we do that? Just solve the optimization problem
min
Wt≥0
∥Xt − WtH∥
2
F
where H is fixed as the H matrix we learned in the fit step. Then Wt
is used as the dim-reduced
version of Xt
.
QUESTION 4: Reduce the dimensionality of the data using the methods above:
• Plot the explained variance ratio across multiple different k = [1, 10, 50, 100, 200, 500, 1000, 2000]
for LSI and for the next few sections choose k = 50. What does the explained variance ratio
plot look like? What does the plot’s concavity suggest?
• With k = 50 found in the previous sections, calculate the reconstruction residual MSE error
when using LSI and NMF – they both should use the same k = 50. Which one is larger, the
∥X − WH∥
2
F
in NMF or the

X − UkΣkVT
k


2
F
in LSI and why?
4 Classification Algorithms
In this part, you are asked to use the dimensionality-reduced training data from
LSI with your choice of k to train (different types of) classifiers, and evaluate the
6
trained classifiers with test data. Your task would be to classify the documents into two
classes (for now a binary classification task) sports versus climate.
Classification Measures: Classification quality can be evaluated using different measures
such as precision, recall, F-score, etc. Refer to the discussion material to find their definition.
Depending on application, the true positive rate (TPR) and the false positive rate (FPR) have
different levels of significance. In order to characterize the trade-off between the two quantities,
we plot the receiver operating characteristic (ROC) curve. For binary classification, the curve
is created by plotting the true positive rate against the false positive rate at various threshold
settings on the probabilities assigned to each class (let us assume probability p for class 0 and
1 − p for class 1). In particular, a threshold t is applied to value of p to select between the two
classes. The value of threshold t is swept from 0 to 1, and a pair of TPR and FPR is got for
each value of t. The ROC is the curve of TPR plotted against FPR.
Support Vector Machines (SVM): Linear Support Vector Machines are seemingly efficient when dealing with sparse high dimensional datasets, including textual data. They have
been shown to have good generalization and test accuracy, while having low computational
complexity.
These models learn a vector of feature weights, w, and an intercept, b, given the training
dataset. Once the weights are learned, the label of a data point is determined by thresholding
wTx + b with 0, i.e. sign(wTx + b). Alternatively, one produce probabilities that the data
point belongs to either class, by applying a logistic function instead of hard thresholding, i.e.
calculating σ(wTx + b).
The learning process of the parameter w and b involves solving the following optimization
problem:
min
w,b
1
2
∥w∥
2
2 + γ
Xn
i=1
ξi
s.t. yi(wTxi + b) ≥ 1 − ξi
ξi ≥ 0,
∀i ∈ {1, . . . , n}
where xi
is the ith data point, and yi ∈ {0, 1} is its class label.
Minimizing the sum of the slack variables corresponds to minimizing the loss function on the
training data. On the other hand, minimizing the first term, which is basically a regularization
term, corresponds to maximizing the margin between the two classes. Note that in the objective
function, each slack variable represents the amount of error that the classifier can tolerate
for a given data sample. The trade-off parameter γ controls relative importance of the two
components of the objective function. For instance, when γ ≫ 1, misclassification of individual
points is highly penalized. This is called “Hard Margin SVM”. In contrast, a “Soft Margin
SVM”, which is the case when γ ≪ 1, is very lenient towards misclassification of a few individual
points as long as most data points are well separated.
QUESTION 5: Compare and contrast hard-margin and soft-margin linear SVMs:
• Train two linear SVMs:
– Train one SVM with γ = 1000 (hard margin), another with γ = 0.0001 (soft margin).
– Plot the ROC curve, report the confusion matrix and calculate the accuracy, recall,
precision and F-1 score of both SVM classifiers on the testing set. Which one performs
better? What about for γ = 100000?
– What happens for the soft margin SVM? Why is the case? Analyze in terms of the
confusion matrix.
7
∗ Does the ROC curve reflect the performance of the soft-margin SVM? Why?
• Use cross-validation to choose γ (use average validation 3
accuracy to compare): Using a
5-fold cross-validation, find the best value of the parameter γ in the range {10k
| − 3 ≤ k ≤
6, k ∈ Z}. Again, plot the ROC curve and report the confusion matrix and calculate the
accuracy, recall precision and F-1 score of this best SVM.
Logistic Regression: Logistic regression is a probability model that can be used for binary
classification. In logistic regression, a logistic function (σ(ϕ) = 1/(1 + exp (−ϕ))) acting on a
linear function of the features (ϕ(x) = wTx + b) is used to calculate the probability that a data
point belongs to a particular binary class, and during the training process, parameters w and
b that maximize the likelihood of predicting the correct labels on the training data are learned.
One can also add a regularization term to the objective function, so that the goal of the training
process is not only in maximizing the likelihood, but also in minimizing the regularization term,
which is often some norm of the parameter vector w. Adding regularization helps prevent illconditioned results and over-fitting, and facilitates generalization. A coefficient is used to
control the trade-off between maximizing likelihood and minimizing the regularization term.
QUESTION 6: Evaluate a logistic classifier:
• Train a logistic classifier without regularization (you may need to come up with some way to
approximate this if you use sklearn.linear model.LogisticRegression); plot the ROC
curve and report the confusion matrix and calculate the accuracy, recall precision and F-1
score of this classifier on the testing set.
• Find the optimal regularization coefficient:
– Using 5-fold cross-validation on the dimension-reduced-by-SVD training data, find the optimal regularization strength in the range {10k
|−5 ≤ k ≤ 5, k ∈ Z} for logistic regression
with L1 regularization and logistic regression with L2 regularization, respectively.
– Compare the performance (accuracy, precision, recall and F-1 score) of 3 logistic classifiers: w/o regularization, w/ L1 regularization and w/ L2 regularization (with the best
parameters you found from the part above), using test data.
– How does the regularization parameter affect the test error? How are the learnt coefficients affected? Why might one be interested in each type of regularization?
– Both logistic regression and linear SVM are trying to classify data points using a linear
decision boundary. What is the difference between their ways to find this boundary? Why
do their performances differ? Is this difference statistically significant?
Na¨ıve Bayes Model: Scikit-learn provides a suite of Na¨ıve Bayesian classifiers including
MultinomialNB, BernoulliNB, and GaussianNB. Na¨ıve Bayes classifiers use the assumption
that features are statistically independent of each other when conditioned by the class the data
point belongs to, in order to simplify the calculation for the Maximum a posteriori (MAP)
estimation of the labels. That is,
P(xi
| y, x1, . . . , xi−1, xi+1, . . . , xm) = P(xi
| y) i ∈ {1, . . . , m}
where xi
’s are features and y is the label of the data point. The MultinomialNB, BernoulliNB,
and GaussianNB use different underlying probability models.
3Validation is done on a subset of the previously obtained training data; NEVER on the testing data. k-fold CV
conventionally means that the 5 folds are chosen without replacement - so each data point has a chance to be in the
validation set. The same train test split function used to splice the larger dataset can be used to splice the training
dataset into a training set per batch and a small validation set.
8
QUESTION 7: Evaluate and profile a Na¨ıve Bayes classifier: Train a GaussianNB classifier; plot
the ROC curve and report the confusion matrix and calculate the accuracy, recall, precision and
F-1 score of this classifier on the testing set.
Grid Search of Parameters
Now we have gone through the complete process of training and testing individual parts of an
end-to-end pipeline that performs binary classification.
Optimizing the hyperparameters for individual parts however does not imply that serially combining these optimized parts creates the optimal end-to-end pipeline. In this section, we combine
everything we have done in previous sections into a GridSearch task.
In a “grid search”, each hyperparameter to be tuned is placed along one axis of a grid. And the
validation accuracy is compared for each combination of hyperparameters - essentially filling
out the grid.
Note: This part of the project takes a long time to run because the number of options at each
stage of the classification pipeline increases the number of possibilities combinatorially. Please
give yourself and your computer time to complete this task.
QUESTION 8: In this part, you will attempt to find the best model for binary classification.
• Construct a Pipeline that performs feature extraction, dimensionality reduction and classification;
• The evaluation of each combination is performed with 5-fold cross-validation (use the average
validation set accuracy across folds).
• In addition to any other hyperparameters you choose, your gridsearch must at least include:
Table 1: Minimum set of hyperparameters to consider for pipeline comparison
Module Options
Loading Data Clean the data
Feature Extraction min df = 3 vs 5 while constructing the vocabulary; AND
use Lemmatization vs Stemming as a compression module
Dimensionality Reduction LSI (k = [5, 30, 80]) vs NMF (k = [5, 30, 80])
Classifier
SVM with the best γ previously found
vs
Logistic Regression: L1 regularization vs L2 regularization,
with the best regularization strength previously found
vs
GaussianNB
Note: You can once again find the optimal hyperparameters
for each classifier, but this is not required.
Other options Use default
• What are the 5 best combinations? Report their performances on the testing set.
Hint: see these links for useful libraries: Examples and some more examples.
9
Multiclass Classification
We have so far been dealing with classifying the data points into two classes. In this part, we
explore multiclass classification techniques through different algorithms. You will need to
use the column of labels marked leaf label for this section.
Some classifiers perform the multiclass classification inherently. As such, na¨ıve Bayes algorithm
finds the class with the maximum likelihood given the data, regardless of the number of classes.
In fact, if the probability of each class label is computed in the usual way, then the class with
the highest probability is picked; that is
cˆ = arg min
c∈C
P(c | x)
where c denotes a class to be chosen, and ˆc denotes the optimal class.
For SVM, however, one needs to extend the binary classification techniques when there are
multiple classes. A natural way to do so is to perform a one versus one classification on all

|C|
2

pairs of classes, and given a document the class is assigned with the majority vote.
In case there is more than one class with the highest vote, the class with the highest total
classification confidence levels in the binary classifiers is picked.
An alternative strategy would be to fit one classifier per class, which reduces the number of
classifiers to be learned to |C|. For each classifier, the class is fitted against the union of all
the other classes. Note that in this case, the unbalanced number of documents in each class
should be handled. By learning a single classifier for each class, one can get insights on the
interpretation of the classes based on the features.
QUESTION 9: In this part, we aim to learn classifiers on the documents belonging to unique
classes in the column leaf label.
Perform Na¨ıve Bayes classification and multiclass SVM classification (with both One VS One and
One VS the rest methods described above) and report the confusion matrix and calculate the
accuracy, recall, precision and F-1 score of your classifiers. How did you resolve the class
imbalance issue in the One VS the rest model?
In addition, answer the following questions:
• In the confusion matrix you should have an 9 × 9 matrix where 9 is the number of unique
labels in the column leaf label. Please make sure that the order of these labels is as
follows:
map_row_to_class = {0:"chess", 1:"cricket", 2:"hockey", 3:"soccer",
4:"football", 5:"%22forest%20fire%22", 6:"flood", 7:"earthquake",
8:"drought"}
,→
,→
Do you observe any structure in the confusion matrix? Are there distinct visible blocks on the
major diagonal? What does this mean?
• Based on your observation from the previous part, suggest a subset of labels that should be
merged into a new larger label and recompute the accuracy and plot the confusion matrix.
How did the accuracy change in One VS One and One VS the rest?
• Does class imbalance impact the performance of the classification once some classes are
merged? Provide a resolution for the class imbalance and recompute the accuracy and plot
the confusion matrix in One VS One and One VS the rest?.
10
Word Embedding
In the previous parts of this project we relied on the TF-IDF vector representation of text for
our feature extraction. There are two disadvantage of such a representation:
1. TF-IDF vectors are not learned from a larger universal set of documents: The
fixed vocabulary is generated from the training corpus: The representation generation
process is not well-transferable to other tasks or datasets. As humans, we do not learn to
identify critical features of a document from a limited set of documents; rather we have
a more general corpus from which we model our feature extractors and apply them on a
specific document.
2. TF-IDF does not attempt to condition the representation of a word with respect to its
context (words around it) within a document. It calculates the scores based on relative
occurrences of words in the text.
In this section we extract word features using a framework called GLoVE: Global Vectors for
Word Representation. These features are trained on a much larger text corpus to take into
account the broader contexts of words while forming word representations, making the final
vectors informative and useful in a downstream task without any supervision or knowledge of
the data used in the task.
A word representation is most often a vector of fixed dimension that captures the meaning of
a target word by relating its numerical vector to other vectors in an estimated vector space.
GLoVE in particular, measures the similarity between a pair of words by their representations’
Euclidean distance in the vector space of representations.
QUESTION 10: Read the paper about GLoVE embeddings - found here and answer the following
subquestions:
(a) Why are GLoVE embeddings trained on the ratio of co-occurrence probabilities rather than
the probabilities themselves?
(b) In the two sentences: “James is running in the park.” and “James is running for the
presidency.”, would GLoVE embeddings return the same vector for the word running in both
cases? Why or why not?
(c) What do you expect for the values of,
||GLoVE["queen"] - GLoVE["king"] - GLoVE["wife"] + GLoVE["husband"]||2,
||GLoVE["queen"] - GLoVE["king"]||2 and ||GLoVE["wife"] - GLoVE["husband"]||2 ?
Compare these values.
(d) Given a word, would you rather stem or lemmatize the word before mapping it to its GLoVE
embedding?
Please download the glove.6B.zip embedding file from the link below: https://nlp.stanford.
edu/projects/glove/. Note that the zip file contains 4 different versions of GLoVE. Use
glove.6B.300d.txt for this project. The code below shows how to use GLOVE embedding:
embeddings_dict = {}
dimension_of_glove = 300
with open("glove/glove.6B.300d.txt", 'r') as f: # if 'r' fails with unicode
,→ error, please use 'rb'
11
for line in f:
values = line.split()
word = values[0]
vector = np.asarray(values[1:], "float32")
embeddings_dict[word] = vector
GLoVE embedding is a representation for each word while TF-IDF vector is per text segment. A text segment consists of many words.
Simple concatenation of all of the embeddings of words in that segment will result in a highdimensional vector representation of varying sizes. Therefore, we need to design a feature
construction procedure to make use of the word embeddings. The goals include:
1. The representation of segments of text of varying lengths – we plotted the histogram
in Question 1 – should be mapped into the same dimension to facilitate downstream
applications;
2. The representation should be a sufficient description of the content of a document, so as
to support the classification.
QUESTION 11: For the binary classification task distinguishing the “sports” class and “climate”
class:
(a) Describe a feature engineering process that uses GLoVE word embeddings to represent each
document. You have to abide by the following rules:
• A representation of a text segment needs to have a vector dimension that CANNOT
exceed the dimension of the GLoVE embedding used per word of the segment.
• You cannot use TF-IDF scores (or any measure that requires looking at the complete
dataset) as a pre-processing routine.
• Important: In this section, feel free to use raw features from any column in the original
data file not just full text. The column keywords might be useful... or not.
• To aggregate these words into a single vector consider normalization the vectors, averaging
across the vectors.
(b) Select a classifier model, train and evaluate it with your GLoVE-based feature. If you are doing
any cross-validation, please make sure to use a limited set of options so that your code finishes
running in a reasonable amount of time.
If your model is performing poorly, try to improve your feature engineering pipeline and also
revisit your hyperparameters.
QUESTION 12: Plot the relationship between the dimension of the pre-trained GLoVE embedding
and the resulting accuracy of the model in the classification task. Describe the observed trend. Is
this trend expected? Why or why not? In this part use the different sets of GLoVE vectors from the
link.
Visualize the set of normalized GLoVE-based embeddings of the documents with their binary
labels in a 2D plane using the UMAP library. Similarly generate a set of normalized random
vectors of the same dimension as GLoVE and visualize these in a 2D plane with UMAP.
QUESTION 13: Compare and contrast the two visualizations. Are there clusters formed in either
or both of the plots? We will pursue the clustering aspect further in the next project.
12
Submission
Your submission should be made to both of the two places: BruinLearn and Gradescope within
BruinLearn.
BruinLearn Please submit a zip file containing your report, and your codes with a readme
file on how to run your code to CCLE. The zip file should be named as
“Project1 UID1 UID2 ... UIDn.zip”
where UIDx’s are student ID numbers of the team members. Only one submission per
team is required. If you have any questions, please ask on Piazza or through email.
Piazza is preferred.
Gradescope Please submit your report to Gradescope as well. Please specify your group
members in Gradescope. It is very important that you assign each part of your report to
the question number provided in the Gradescope template.
FAQ
• My NMF objective function is not converging to a minimum fast/it is timing
out. Help!: If the iterative algorithm is taking too long to converge, please increase the
default tolerance parameter and/or reduce the maximum number of iterations that the
algorithm can run for.
• My Gridsearch is taking too long. Help!: If the GridSearch takes too long, you are
welcome to use RandomizedSearch instead that samples the parameters in the grid space
instead of running the experiments exhaustively. Note that we are not responsible for the
correctness of the randomized search; i.e if the sampling is too low, you are not likely to
get a set of parameters that are truly optimal.
• In Q9, how many classes should we merge? Scan your confusion matrices from the
previous sections to see if any pair of classes stand out as being similar. Of course the
combined label needs to make semantic sense as well; there is no use in combining all the
classes into 1 for example, and claiming that the performance of your classifier is optimal.
I recommend finding a pair of labels within a binary class that can be merged.
• I am finding it difficult to run experiments on my local machine. You are highly
encouraged to learn to use Google Colab for your projects in this course. While Project 1 is
doable on your local machine, GridSearch is expensive and will take on average 4 hours to
complete (anywhere from 1/2 hour to 11 hours have been reported). Furthermore, future
projects will occasionally use Neural Networks which are optimized to run on GPUs.
These are provided for free on Google Colab.
13

More products