Starting from:

$29

Homework 4 Attention Mechanisms in Sequence-to-Sequence Models

CS539 Natural Language Processing with Deep Learning – Homework 4
Attention Mechanisms in Sequence-to-Sequence Models
Overview and Objectives. In this homework, we’ll build some intuitions for scaled dot-product attention and implement a simple attention mechanism for an RNN-based sequence-to-sequence model in a small-scale machine translation
task. If you finish early, work on your projects!
How to Do This Assignment. The assignment has a few math questions and then walks you through completing
the provided skeleton code and analyzing some of the results. Anything requiring you to do something is marked as a
"Task" and has associated points listed with it. You are expected to turn in both your code and a write-up answering
any task that requested written responses. Submit a zip file containing your completed skeleton code and a PDF of
your write-up to Canvas.
Advice. Start early. Students will need to become familiar with pytorch for this and future assignments. Extra time
may be needed to get used to working remotely on the GPU cluster here. You can also use GPU-enabled runtimes in
Colab colab.research.google.com.
1 Scaled Dot-Product Attention [8 pts]
To warm up and start getting more familiar with scaled dot-product attention mechanisms, we’ll do some exploratory
math first.1 Recall from the lecture the definition of a single-query scaled dot-product attention mechanism. Given
a query q ∈ R
1×d
, a set of candidates represented by keys k1, ..., km ∈ R
1×d
and values v1, ..., vm ∈ R
1×dv , we
compute the scaled dot-product attention as:
αi =
exp 
qkT
i
/

d

Pm
j=1 exp 
qkT
j
/

d
 (1)
a =
Xm
j=1
αjvj (2)
where the αi are referred to as attention values (or collectively as an attention distribution). The goal of this section
is to get a feeling for what is easy for attention to compute and why we might want something like multi-headed
attention to make some computations easier.
I TASK 1.1 Copying [2pts] Describe (in one or two sentences) what properties of the keys and queries
would result in the output a being equal to one of the input values vj . Specifically, what must be true about
the query q and the keys k1, ..., km such that a ≈ vj ? (We assume all values are unique – vi 6= vj , ∀ij.)
I TASK 1.2 Average of Two [2pts] Consider a set of key vectors {k1, ..., km} where all keys are orthogonal unit vectors – that is to say k
T
i kj = 0, ∀ij and ||ki
|| = 1, ∀i. Let va, vb ∈ {v1, ..., vm} be two value
vectors. Give an expression for a query vector q such that the output a is approximately equal to the average of va and vb, that is to say a ≈
1
2
(va + vb). You can reference the key vectors corresponding to va and
vb as ka and kb respectively. Note that due to the softmax in Eq. 1, it won’t ever actually reach this value,
but you can make it arbitrarily close by adding a scaling constant to your solution.
In the previous task, we saw it was possible for a single-headed attention to focus equally on two values. This can
easily be extended to any subset of values. In the next question we’ll see why it may not be a practical solution.
1These questions are adapted from CS224n at Stanford because I just liked the originals too much not to use them.
1
I TASK 1.3 Noisy Average [2pts] Now consider a set of key vectors {k1, ..., km} where keys are randomly scaled such that ki = µi ∗ λi where λi ∼ N (1, α) is a randomly sampled scalar multiplier. Assume the
unscaled vectors µ1, ..., µm are orthogonal unit vectors. If you use the same strategy to construct the query
q as you did in Task 1.2, what would be the outcome here? Specifically, derive qkT
a
and qkT
b
in terms of µ’s
and λ’s. Qualitatively describe what how the output a would vary over multiple resamplings of λ1, ..., λm.
As we just saw in 1.3, for certain types of noise that either scale (shown here) or change the orientation of the
keys, single-head attention may have difficulty combining multiple values consistently. Multi-head attention can help.
I TASK 1.4 Noisy Average with Multi-head Attention [2pts] Let’s now consider a simple version
of multi-head attention that averages the attended features resulting from two different queries. Here, two
queries are defined (q1 and q2) leading to two different attended features (a1 and a2). The output of this
computation will be a =
1
2
(a1 + a2). Assume we have keys like those in Task 1.3, design queries q1 and q2
such that a ≈
1
2
(va + vb).
2 Attention in German-to-English Machine Translation [12 pts]
In this part of the homework, we are going to get some experience implementing attention mechanisms on top of
familiar components. In HW2 we used bidirectional encoders for Part-of-Speech tagging and in HW3 we decoded
unconditional language models. Here we’ll combine these into a sequence-to-sequence model that translates German
sentences to English. The skeleton code already provides the data loading (using torchtext and spacy), training /
evaluation infrastructure, and encoder/decoder model structure.
To keep training time short (∼5-10 minutes), we are using a small-scale translation datasets called Multi30k [1]
that contains over 31,014 bitext sentences describing common visual scenes in both German and English (split across
train, val, and test). It is intended to support multimodal translation (i.e. utilizing an image of the described scene to
make translation easier) but we will just use it as a text problem for simplicity.
Students can look through the provided code for the implementation details; however, the computation our model
is performing is summarized in the following equations. Consider a training example consisting of a source language
sentence w1, ..., wT and a target language sentence m1, ..., mL. Let wt and mt be one-hot word encodings.
Encoder. Out encoder is a simple bidirectional GRU model. While we write the forward and backward networks
separately, PyTorch implements this as a single API.
zt = Wewt (Word Embedding) (3)
*
h
(e)
t
,
*
c
(e)
t =
*
LSTM 
zt,
*
h
(e)
t−1
,
*
c
(e)
t−1

(Forward LSTM) (4)
(
h
(e)
t
,
(
c
(e)
t =
(
LSTM 
zt,
(
h
(e)
t+1,
(
c
(e)
t+1
(Backward LSTM) (5)
h
(e)
t =
*
h
(e)
t
,
(
h
(e)
t

∀t (Word Representations) (6)
h
(e)
sent. = ReLU 
We
*
hT ,
(
h0

+ be

(Sentence Representation) (7)
(8)
2
Decoder. Our decoder is a unidirectional GRU that performs an attention operation over the encoder word representations at each time step (Eq. 12). Notice that we initialize the decoder hidden state with the overall sentence
encoding from the encoder.
h
(d)
0 = h
(e)
sent. (Initialize Decoder) (9)
bi = Wdmi (Word Embedding) (10)
h
(d)
i
, c
(d)
i = LSTM 
bi
, h
(d)
i−1
, c
(d)
i−1

(Forward LSTM) (11)
ai = Attn 
h
(d)
i
, h
(e)
1
, ..., h
(e)
T

(Attention) (12)
P(mi+1 | m≤i
, w1, ..., wT ) = softmax 
Wd
h
ai
, h
(d)
i
i
+ bd

(Prob. of Next Word) (13)
Our explorations in this assignment will be implementing and contrasting different choices for the Attn(q, c1, ..., cT )
module above. All the other elements of the encoder-decoder architecture have already been implemented.
I TASK 2.1 Scaled-Dot Product Attention [8pts] Implement single-query scaled dot-product attention as defined in equations (1) and (2) by completing the SingleQueryScaledDotProductAttention
class in mt_driver.py. The skeleton code is below:
1 class SingleQueryScaledDotProductAttention ( nn . Module ):
2 # kq_dim is the dimension of keys and values . Linear layers should be used
to project inputs to these dimensions .
3 def __init__ ( self , enc_hid_dim , dec_hid_dim , kq_dim =512) :
4 super () . __init__ ()
5 .
6 .
7 .
8
9 # hidden is h_t ^{d} from Eq. (11) and has dim => [ batch_size , dec_hid_dim ]
10 # encoder_outputs is the word representations from Eq. (6)
11 # and has dim => [ src_len , batch_size , enc_hid_dim * 2]
12 def forward ( self , hidden , encoder_outputs ):
13
14 .
15 .
16 .
17
18 assert attended_val . shape == ( hidden . shape [0] , encoder_outputs . shape [2])
19 assert alpha . shape == ( hidden . shape [0] , encoder_outputs . shape [0])
20 return attended_val , alpha
The forward function takes two inputs – hidden and encoder_outputs – corresponding to decoder hidden
state h
(d)
j
and encoder word representations h
(e)
t
, ∀t. These should be converted to key, queries, and values
q = Wqh
(d)
i
(14)
kt = Wkh
(e)
t
(15)
vt = h
(e)
t
(16)
And the output – attended_val and alpha – correspond to the attended vector ai and the vector of attention values (α). The expected dimensions are asserted above. Note that this is intended to be a batched
operation and the equations presented are for a single instance. torch.bmm can be very useful here.
Train this model by executing python mt_driver.py. Record the perplexity and BLEU score on the test
set. These are automatically generated in the script and printed after training.
After implementing the scaled dot-product attention mechanism, running python mt_driver.py will execute
training the model with your attention mechanism. The code saves the checkpoint with the lowest training validation
2 Afterwards, it will report BLEU on the test set as well as produce a number of examples (printed in console) with
attention diagrams (saved in examples folder) like those shown below
2The code can be run with --eval to load an existing checkpoint and run inference.
3
where the brighter blocks indicate high attention values over source word encodings (columns) used when generating
translated words (rows). Remember that these encodings also carry information about the rest of the sentence as they
come from the bidirectional encoder.
I TASK 2.2 Attention Diagrams [1pts] Search through the attention diagrams produced by your model.
Include a few examples in your report and characterize common patterns you observe.Note that German is
(mostly) a Subject-Object-Verb language so you may find attention patterns that indicate inversion of word
order when translating to Subject-Verb-Object English as in the 2nd example above.
The code also implements two baseline options for the attention mechanism. A Dummy attention that just outputs
zero tensors – this effectively attends to no words. The MeanPool attention which just averages the source word
encodings – this effectively attends equally to all words. The code will use these if run with the --attn none and
--attn mean arguments respectively.
I TASK 2.3 Comparison [3pts] Train and evaluate models with the Dummy and MeanPool ‘attention’
mechanisms. Report mean and variance over three runs for these baselines and your implementation of
scaled dot-product attention. Discuss the observed trends.
I TASK 2.EC Beam Search and BLEU [2pts]This is an extra credit question and is optional.
In the previous homework, we implemented many decoding algorithms; however, in this work we just use
greedy top-1 in the translate_sentence function. Adapt your implementation of beam search from HW3
to work on this model by augmenting translate_sentence (which is used when computing BLEU). Report
BLEU scores on the test set for the scaled dot-product attention model with B=5, 10, 20, and 50.
References
[1] D. Elliott, S. Frank, K. Sima’an, and L. Specia, “Multi30k: Multilingual english-german image descriptions,” in
Proceedings of the 5th Workshop on Vision and Language, pp. 70–74, 2016.
4

More products