$30
CPSC 477
HW3
Problem Grade
QUESTION 1 (Language Modeling) /10
QUESTION 2 (Recurrent Neural Networks) /10
QUESTION 3 (Morphology) /10
QUESTION 4 (Text Similarity) /10
QUESTION 5 (Dimensionality Reduction /10
QUESTION 6 (Naïve Bayes Classifier) /10
QUESTION 7 (Training Neural Networks) /10
QUESTION 8 (Word Embeddings) /10
TOTAL /80
Submission Instructions
1. Submit each solution on a separate sheet.
2. Include your name, netID and problem number on each page.
QUESTION 1 (Language Modeling)
You are in a noisy coffee shop, diligently studying for your midterm, and your friend is
trying to get your attention, using only a two words vocabulary. She has said a sentence
but you couldn’t hear one of the words:
(𝑤1 = ′ℎ𝑖′
; 𝑤2 =
′𝑦𝑜′
; 𝑤3 =? ? ? ; 𝑤4 =
′𝑦𝑜′)
Assume that your friend was generating words from this first-order Markov model:
𝑝(′ℎ𝑖′|′ℎ𝑖′) = 0.7 𝑝(′𝑦𝑜′|′ℎ𝑖′) = 0.3
𝑝(′ℎ𝑖′|′𝑦𝑜′) = 0.5 𝑝(′𝑦𝑜′|′𝑦𝑜′) = 0.5
Given these parameters, what is the posterior probability of whether the missing word is
‘hi’ or ‘yo’? Show your work.
QUESTION 2 (Recurrent Neural Networks)
Your goal will be to set up weights for a simple recurrent network (SRN and also known
as an Elman Netowork) to solve two simple sequence tasks. The network reads in a
sequence 𝑥1, . . , 𝑥𝑡and produces outputs 𝑦1
, . . , 𝑦𝑡
. We consider the final output 𝑦𝑡
to be the
answer given by the network.
We will assume a threshold activation function:
𝑓(𝑥) = 1 if 𝑥 0; otherwise 𝑓(𝑥) = 0.
The network has one simple recurrent layer followed by a linear layer (shown in below
diagram). The dimension of both the input to the network and its output should be 1.
1. Construct an SRN with 1 hidden unit that reads a sequence of 1s and 0s and
outputs 1 if any of 𝑥1, .., 𝑥𝑡 had the value 1. Justify the correctness of your network.
2. Construct an SRN with 2 hidden units to solve the parity task (read in a sequence
of 1s and 0s and return the sum mod 2, shown in the below table). Justify the
correctness of your network.
𝑥 𝑦
0 0
1 1
010 1
0110 0
10101 1
QUESTION 3 (Morphology)
Refer to the slides for lecture #143 for background on this question.
1. Consider the English verb drink. Give two inflected forms of drink, and two
derivational forms of drink.
2. Identify all the morphemes in infallibilities. For each morpheme, say whether it is a
prefix, suffix, or root.
3. Consider the word unnegotiable. Draw and label two morphological trees for this
word, one which corresponds to the meaning ‘not able to be negotiated’, and
another which corresponds to ‘able to be unnegotiated’.
QUESTION 4 (Text Similarity)
We first need to introduce a function that compares the similarity of two sentences.
Recall the definition for the cosine distance between two vectors 𝑥 and 𝑦:
𝑐𝑜𝑠𝑖𝑛𝑒(𝑥, 𝑦) =
∑ 𝑥𝑖𝑦𝑖
𝑛
𝑖=1
√∑ 𝑥
𝑛 2
𝑖=1 √∑ 𝑦
𝑛 2
𝑖=1
Part A: How would you define a mapping from sentences s to vectors 𝑓(𝑠) such that the
cosine distance 𝑐𝑜𝑠𝑖𝑛𝑒(𝑓(𝑠1), 𝑓(𝑠2)) is a measure of the similarity between two
sentences 𝑠1 and 𝑠2? We will give full marks for any reasonable mapping.
Part B: Now, we will define a dynamic programming algorithm that computes sentence
alignment of two translated documents. The alignment score is the sum of the similarity
scores of the aligned sentences. Our goal is to find an alignment with the highest score.
We will consider alignments of the following form:
● a sentence can be aligned to an empty sentence. This happens when one of the
translators omits a sentence.
● a sentence can be aligned to exactly one sentence.
● a sentence can be aligned to two sentences. This happens when one of the
translators either breaks or joins sentences.
Our sentence alignment algorithm recursively computes the alignment matrix 𝐹 indexed
by 𝑖 and 𝑗, one index for each sentence. The value stored in 𝐹(𝑖,𝑗) is the score of the
best alignment between the first 𝑖 sentences of 𝑥 and the first 𝑗 sentences of 𝑦. 𝑠(𝑥𝑖
, 𝑦𝑗)
is the similarity between sentence 𝑥𝑖 and sentence 𝑦𝑗
.
Based on the above information, please
Define 𝐹(0,0), 𝐹(𝑖, 0) and 𝐹(0,𝑗).
Define 𝐹(𝑖,𝑗) for 𝑖 0 and 𝑗 0.
Part C: Next, we will modify the alignment score to be the same as before, but to include
a fixed penalty 𝑝 each time a sentence is aligned to an empty sentence. 𝑝 is a parameter,
which is = 0, chosen by the user of the algorithm. Describe a modified dynamic
programming method that takes this new penalty into account.
QUESTION 5 (Dimensionality reduction and SVD)
(a) Prove that the left singular vectors of a matrix 𝐴 are the right singular vectors of 𝐴
𝑇
(b) Calculate the covariance matrix for the matrix 𝑀 shown below.
𝑀 = [
1 1 0 1
0 0 0 1
1 1 0 0
]
(c) Find the singular values of the matrix 𝑀
QUESTION 6 (Naïve Bayes Classifier)
Suppose you are given the following set of data with three Boolean input variables 𝑎, 𝑏,
and 𝑐, and a single Boolean output variable 𝐾.
𝑎 𝑏 𝑐 𝑲
1 0 1 0
1 1 1 1
0 1 1 0
1 1 0 0
1 0 1 0
0 0 0 1
0 0 0 1
0 0 1 0
(For parts (a) and (b), assume we are using a Naïve Bayes classifier to predict the
value of K from the values of the other variables.)
(a) According to the naive Bayes Classifier, what is 𝑝(𝐾 = 1|𝑎 = 1, 𝑏 = 1, 𝑐 = 0) ?
(b) According to the naive Bayes Classifier, what is 𝑝(𝐾 = 0|𝑎 = 1, 𝑏 = 1) ?
(c) Given 𝑃(𝑍|𝑋) = 0.7 and 𝑃(𝑍|𝑌) = 0.4.
Do you have enough information to compute 𝑃(𝑍|𝑋, 𝑌)? If not, write “not enough info”.
If so, compute the value of 𝑃(𝑍|𝑋, 𝑌) from the above information
(e) Given 𝑃(𝑍, 𝑋) = 0.2, 𝑃(𝑋) = 0.3, 𝑃(𝑌) = 1
Do you now have enough information to compute 𝑃(𝑍|𝑋, 𝑌)? If not, write “not enough
info”. If so, compute the value of 𝑃(𝑍|𝑋, 𝑌)from the above information.
QUESTION 7 (Training Neural Networks)
This question is to train a three-layer neural network for classification task. Let 𝐻1 denote
the number of hidden units, let 𝐷 be the dimension of the input 𝑋, and let 𝐾 be the
number of classes. The three-layer network has the following form:
ℎ1 = 𝑅𝑒𝐿𝑈(𝑊1𝑋+𝑏1
)
ℎ2 = 𝑅𝑒𝐿𝑈(𝑊2ℎ1 +𝑏2
)
𝑝 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑊3ℎ2 + 𝑏3)
where the parameters of the network are 𝑊1 ∈ 𝑅
𝐻1
∗𝐷
, 𝑏1∈ 𝑅
𝐻1
, 𝑊2 ∈ 𝑅
𝐻2
∗𝐻1
,
𝑏2∈ 𝑅
𝐻2
, 𝑊3 ∈ 𝑅
𝐾∗𝐻2
, 𝑏3∈ 𝑅
𝐾
.
𝑅𝑒𝐿𝑈 is the “rectified linear unit” 𝑅𝑒𝐿𝑈(𝑥) = 𝑚𝑎𝑥{0, 𝑥} applied componentwise, and
SoftMax maps a d-vector to the probability simplex according to
𝑆𝑜𝑓𝑡𝑚𝑎𝑥(𝑣)𝑘 =
𝑒𝑥𝑝(𝑣𝑘)
∑ 𝑒𝑥𝑝(𝑣𝑗)
𝐾
𝑗=1
For a given input 𝑋, this specifies how to calculate the probabilities 𝑃(𝑌 = 𝑘|𝑋) for 𝑘 =
1,2, . . . , 𝑘.
For a given training set {(𝑋𝑖
, 𝑌𝑖
)}, consider the loss function:
𝐿 =
1
𝑛
∑− 𝑙𝑜𝑔 𝑝(𝑌 = 𝑌𝑖
|𝑋𝑖
) +
𝜆
2
(‖𝑊1‖
2 + ‖𝑊2‖
2 + ‖𝑊3‖
2
)
𝑛
𝑖=1
where ‖𝐴‖
2 means the sum of the squares of the entries of the matrix 𝐴.
Give formulas for the derivatives for each layer of the network𝑗 = 1,2,3.
𝜕𝐿
𝜕𝑊𝑗
,
𝜕𝐿
𝜕𝑏𝑗
Hints:
i. You should ignore the fact that 𝑅𝑒𝐿𝑈(𝑥) is not differentiable at 𝑥𝑖 = 0.
ii. You might try to first do the calculations with 𝑅𝑒𝐿𝑈() replaced by the identity function.
QUESTION 8 (Word Embeddings1
)
Word2Vec represents a family of word-embedding algorithms that are commonly used in
a variety of contexts.
(a) Let’s consider a recommender system for an online music-streaming service. It
includes information about a set of songs on how frequently users play them together
(e.g., song X is commonly played together with song Y). Explain how you would use
ideas similar to Word2Vec to recommend the next song to a user who has played
some songs from this list.
(b) Although pre-trained word vectors work quite well in many applications, it is
sometimes better to ‘’re-train” (i.e. continue to learn) the word vectors as parameters
of our neural network. Explain why retraining the word vectors may hurt the model
performance if the dataset for the specific task is too small.
(c) Give two examples of how we can evaluate word-embedding vectors.
(d) A simple way to produce word embeddings for a vocabulary of words is to create onehot vectors for each word - a vector with 0s everywhere and only a single 1 in the
position corresponding to the word’s index in the vocabulary. Give at least two reasons
why word2vec-produced word embeddings are better than these one-hot vectors.