Problem Set 3 Recurrent Neural Network

Your shopping cart is empty.

CS4803/7643: Deep Learning

Problem Set 3

Instructions
1. We will be using Gradescope to collect your assignments. Please read the following instructions
for submitting to Gradescope carefully!
• Each subproblem must be submitted on a separate page. When submitting to Gradescope, make sure to mark which page(s) corresponds to each problem/sub-problem
• When submitting to Gradescope, make sure to mark which page corresponds to each
problem/sub-problem. ALSO remember to append your notebook PDFs to the solutions
for the problem set, as described in the instructions!
• For the coding problem, please use the provided collect_submission.sh script and upload
cs7643_hw3.zip to the HW3 Code assignment on Gradescope. While we will not be
explicitly grading your code, you are still required to submit it. Please make sure you
have saved the most recent version of your jupyter notebook before running this script.
• Note: This is a large class and Gradescope’s assignment segmentation features are essential. Failure to follow these instructions may result in parts of your assignment not
being graded. We will not entertain regrading requests for failure to follow instructions.
Please read https://stats200.stanford.edu/gradescope_tips.pdf for additional information on submitting to Gradescope.
2. LATEX’d solutions are strongly encouraged (solution template available at
cc.gatech.edu/classes/AY2020/cs7643_fall/assets/sol2.tex), but scanned handwritten copies
are acceptable. Hard copies are not accepted.

1
1 Recurrent Neural Network
1. [Vanilla RNN for parity function: 4 points]
Let us define a sequence parity function as a function that takes in a sequence of binary inputs
and returns a sequence indicating the number of 1’s in the input so far; specifically, if at time
t the 1’s in the input so far is odd it returns 1, and 0 if it is even. For example, given input
sequence [0, 1, 0, 1, 1, 0], the parity sequence is [0, 1, 1, 0, 1, 1]. Let xt denote the input at time
t and yt be the boolean output of this parity function at time t.
Design a vanilla recurrent neural network to implement the parity function. Your implementation should include the equations of your hidden units and the details about activations
with different input at xt (0 or 1).
Hint: Recall that we have implemented AND and OR functions in problem set 2, which may
be useful here.
2. [LSTM for parity function: 4 points]
Let us recall different gates in an LSTM Network. First gate is the "forget gate layer":
ft = σ(Wf .[ht−1, xt
] + bf ) (1)
where ft
is the output of forget gate, Wf is the weight matrix, ht−1 is the hidden state of step
t-1, xt
is the current input and bt
is the bias value.
Next we have "input gate layer":
it = σ(Wi
.[ht−1, xt
] + bi) (2)
C˜
t = tanh(WC.[ht−1, xt
] + bC) (3)
where it decides which values we will update and C˜
t are the new candidate values that could
be added to the cell state. Next we have new cell state candidate values:
Ct = ft ∗ Ct−1 + it ∗ C˜
t (4)
Finally, we have the output and hidden state
ot = σ(Wo.[ht−1, xt
] + bo) (5)
ht = ot ∗ tanh(Ct) (6)
Design an LSTM Network for the bit parity problem mentioned in Question 1. Specifically,
provide values for Wf , bf , Wi
, bi
, WC, bC, Wo and bo such that the cell state Ct store the
parity of bit string. Please mention any assumptions you make. For this problem, you can
assume below for Sigmoid and tanh function:
σ(x) = (
1, if x 0
0, otherwise
(7)
tanh(x) =



1, if x 0
0, x = 0
−1, if x < 0
(8)
2
Hint: Recall that XOR of x and y can be represented as (x ∧ y¯) ∨ (x¯ ∧ y). Think about how
you can leverage this information for equation (4).
3. [When to stop in beam search?: 4 points]
Beam Search is a technique widely used to improve the output text quality of neural text
generation. But it is difficult to decide when to stop beam search to obtain optimality because
hypotheses can finish in different steps. Assume an input x from which we generate an output
sentence y which is a completed hypothesis:
y∗ = argmax
y:comp(y)
Y
i≤|y|
p(yi
|x, y<i) (9)
where y<i is a shorthand notation of y0y1...yi−1 and comp(y) is shorthand for completed
hypothesis. We say that a hypothesis y is completed (comp(y)), if its last word is </s, i.e.,
comp(y)
def = (y|y| = </s)
in which case it will not be further expanded.
Given Bi−1 is an b length ordered list, consisting of pairs hy, si of the current prefix y and
the accumulated score (product of probabilities) s , we use beam search to approximately find
the best output y
∗
. Beam search can be defined as an operator topb
that selects the top b
scoring items:
Bi =
b
top{hy
0
◦ yi
, s·p(yi
|x, y)i | hy
0
, si ∈ Bi−1} (10)
Let us define a beam search algorithm that will always return the optimal score complete
hypothesis (modulo beam size) and finish as soon as optimality is reached. Let best≤i be the
best completed hypothesis so far up to step i, i.e.
best≤i
∆=maxn
y ∈ ∪j≤iBj |comp(y)
o
(11)
We update it every time we find a completed hypothesis (if there is none yet, then it remains
undefined). Given that at any step i, if best≤i
is defined and the highest scoring item in the
current beam Bi scores worse than or equal to best≤i
, prove that the current best completed
hypothesis (obtained from best≤i) is the overall highest-probability completed hypothesis and
future steps will be no better than the best completed hypothesis.
4. [Exploding Gradients: 4 points; Extra credit for 4803]
Learning long-term dependencies in recurrent networks suffers from a particular numerical
challenge – gradients propagated over many time-steps tend to either ‘vanish’ (i.e. converge
to 0, frequently) or ‘explode’ (ı.e. diverge to infinity; rarely, but with more damage to the
optimization). To study this problem in a simple setting, consider the following recurrence
relation without any nonlinear activation function or input x:
ht = Wht−1 (12)
where W is a weight sharing matrix for recurrent relation at any time t. Let λ1, ..., λn be the
eigenvalues of the weight matrix W ∈ C
n×n
. Its spectral radius ρ(W) is defined as:
3
ρ(W) = max{|λ1|, ..., |λn|} (13)
Assuming the initial hidden state is h0, write the relation between hT and h0 and explain the
role of the eigenvalues of W in determining the ‘vanishing’ or ‘exploding’ property as T ? 0
2 Coding: Sequence models for image captioning [55 regular points
for CS7643, 45 regular points for CS4803]
The coding part of this assignment will consist of implementation of sequence models for captioning images. To get started, go to https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/
8cc2zXixdvAocq4v4upA9A/hw3/
4

Shopping cart

US$0

Problem Set 3 Recurrent Neural Network

More products