Starting from:

$35

Assignment 3 k-layer networks

Course: DD2424 - Assignment 3
In this assignment you will train and test k-layer networks with multiple
outputs to classify images (once again) from the CIFAR-10 dataset. You
will upgrade your code from Assignment 2 in two significant ways:
1. Generalize your code so that you can train and test k-layer networks.
2. Incorporate batch normalization into the k-layer network both for
training and testing.
The overall structure of your code for this assignment should mimic that
from Assignment 2. You will mainly just have to modify the functions that
implement the forward and backward passes. As in Assignment 2 you will
train your network with mini-batch gradient descent and cyclical learning
rates. Before the explicit instructions for the assignment, we present the
mathematical details that you will need to complete the assignment. As in
the previous assignment we will train our networks by minimizing a cost
function, a weighted sum of the cross-entropy loss on the labelled training
data and an L2 regularization of the weight matrices see equation (20) for
the general form, using mini-batch gradient descent.
Background 1: k-layer network
The mathematical details of the first network you will implement are as
follows. Given an input vector, x, of size d×1 our classifier outputs a vector
of probabilities, p (K × 1), for each possible output label.
for l = 1, 2, . . . , k − 1
s
(l) = Wl x
(l−1) + bl (1)
x
(l) = max(0, s
(l)
) (2)
and then finally
s = Wk x
(k−1) + bk (3)
p = SOFTMAX(s) (4)
The equations for the gradient computations of the back-propagation algorithm for a k-layer network are given in Lecture 4. I suggest you implement
the efficient version of the backward pass at it will make your computations
much faster.
Background 2: k-layer network with Batch Normalization
1
You will discover that training a network is tricky as its number of layers
increase. A proper initialization of the weights is key and ηmax in the cyclic
learning rate approach may be relatively small and thus training is slow.
The second part of the assignment is therefore devoted to implementing
batch normalization to overcome these limitations and also to get a feel for
the effect of this process on training. We now give the explicit mathematical
details for batch normalization for a k-layer network.
At test time it is assumed that you have a pre-computed
• µ
(l)
- an estimated mean for the unnormalized scores s
(l) at layer l
(has the same size as s
(l)
),
• v
(l)
- the vector containing the estimated variance for each dimension
of s
(l)
.
It is also assumed that you have learnt during training extra parameters
γ1
, . . . , γk−1 and β1
, . . . , βk−1
to scale and shift each entry of the normalized
activations at each layer. Batch normalization, followed by a scale and shift,
is then implemented at test time with these equations:
for l = 1, 2, . . . , k − 1
s
(l) = Wl x
(l−1) + bl (5)
ˆs
(l) = BatchNormalize(s
(l)
, µ
(l)
, v
(l)
) (6)
˜s
(l) = γl ? ˆs
(l) + βl
(7)
x
(l) = max(0, ˜s
(l)
) (8)
and then finally
s = Wk x
(k−1) + bk (9)
p = SOFTMAX(s) (10)
where
BatchNormalize(s
(l)
, µ
(l)
, v
(l)
) = ?
diag(v
(l) + ?)
?− 1
2
?
s
(l) − µ
(l)
?
(11)
and ? 0 is a small number, of the order of magnitude of Matlab’s eps
constant, to ensure you don’t divide by 0.
Forward pass of BN for back-propagation training
During the forward pass of BN training for each mini-batch you also normalize the scores at each layer, but you compute the mean and variances of
2
the un-normalized scores from the data in the mini-batch. In more detail
assume that we have a mini-batch of data B = {(x1, y1), . . . ,(xn, yn)}. At
each layer 1 ≤ l ≤ k − 1 you must make the following computations. Compute the un-normalized scores at the current layer l for each example in the
mini-batch
s
(l)
i = Wl x
(l−1)
i + bl
for i = 1, . . . , n (12)
Then compute the mean and variances of these un-normalized scores
µ
(l) =
1
n
Xn
i=1
s
(l)
i
(13)
v
(l)
j =
1
n
Xn
i=1
?
s
(l)
ij − µ
(l)
j
?2
for j = 1, . . . , ml (14)
where ml
is the dimension of the scores at layer l. Given the computed
mean and variances we can now normalize the scores for the mini-batch and
subsequently apply ReLu. So for i = 1, . . . , n:
ˆs
(l)
i = BatchNormalize(s
(l)
i
, µ
(l)
, v
(l)
) (15)
˜s
(l)
i = γl ? ˆs
(l)
i + βl
(16)
x
(l)
i = max(0, ˜s
(l)
i
) (17)
The final layer is then applied as usual, for i = 1, . . . , n:
si = Wk x
(k−1)
i + bk (18)
pi = SOFTMAX(si) (19)
Backward pass of BN for back-propagation training
As we have applied score normalization, plus a scaling and a shifting, during
the forward pass we have to compensate for these in the backward pass of the
back-propagation algorithm. As per usual let J represent the cost function
for the mini-batch that is
J(B, λ, Θ) = 1
n
Xn
i=1
lcross(xi
, yi
, Θ) + λ
X
k
i=1
kWik
2
(20)
From the forward pass of the back-prop algorithm you should store
X
(l)
batch =
?
x
(l)
1
, x
(l)
2
, . . . , x
(l)
n
?
, S(l)
batch =
?
s
(l)
1
, s
(l)
2
, . . . , s
(l)
n
?
, Sˆ
(l)
batch =
?
ˆs
(l)
1
, ˆs
(l)
2
, . . . , ˆs
(l)
n
?
and µ
(l)
, v
(l)
for the intermediary layers l = 1, . . . ,(k −1) and then also the
final probability vectors output for each example in the batch:
Pbatch =

p1, p2, . . . , pn
?

Given these quantities it is then possible to compute the gradient for all the
parameters that have to be learnt in the network:
• Propagate the gradient through the loss and softmax operations
Gbatch = −(Ybatch − Pbatch) (21)
• The gradients of J w.r.t. bias vector bk and Wk
∂J
∂Wk
=
1
n
GbatchX
(k−1)
batch
T
+ 2λWk,
∂J
∂bk
=
1
n
Gbatch1n (22)
• Propagate Gbatch to the previous layer
Gbatch = WT
k Gbatch (23)
Gbatch = Gbatch ? Ind ?
X
(k−1)
batch 0
?
(24)
• For l = k − 1, k − 2, . . . , 1
1. Compute gradient for the scale and offset parameters for layer l:
∂J
∂γl
=
1
n
?
Gbatch ? Sˆ
(l)
batch?
1n,
∂J
∂βl
=
1
n
Gbatch1n (25)
2. Propagate the gradients through the scale and shift
Gbatch = Gbatch
?
γl1
T
n
?
(26)
3. Propagate Gbatch through the batch normalization
Gbatch = BatchNormBackPass ?
Gbatch, S(l)
batch, µ
(l)
, v
(l)
?
(27)
4. The gradients of J w.r.t. bias vector bl and Wl
∂J
∂Wl
=
1
n
GbatchX
(l−1)
batch
T
+ 2λWl,
∂J
∂bl
=
1
n
Gbatch1n (28)
5. If l 1 propagate Gbatch to the previous layer
Gbatch = WT
l Gbatch (29)
Gbatch = Gbatch ? Ind ?
X
(l−1)
batch 0
?
(30)
where the function BatchNormBackPass?
Gbatch, S(l)
batch, µ
(l)
, v
(l)
?
corresponds
to the following steps:
σ1 =
?
(v
(l)
1 + ?)
−.5
, . . . ,(v
(l)
m + ?)
−.5
?T
(31)
σ2 =
?
(v
(l)
1 + ?)
−1.5
, . . . ,(v
(l)
m + ?)
−1.5
?T
(32)
G1 = Gbatch
?
σ11
T
n
?
(33)
G2 = Gbatch
?
σ21
T
n
?
(34)
D = S
(l)
batch − µ
(l)
1
T
n (35)
c = (G2 ? D) 1n (36)
Gbatch = G1 −
1
n
(G11n) 1
T
n −
1
n
D
?
c1T
n
?
(37)
4
assuming v
(l) = (v
(l)
1
, . . . , v
(l)
m )
T
. Remember that the ith column of Gbatch
sent into BatchNormBackPass represents ∂J
∂ˆs
(l)
i
while the ith column of Gbatch
returned by BatchNormBackPass represents ∂J
∂s
(l)
i
.
You should note that the network’s bias parameters bl
for l = 1, . . . , k − 1
are superfluous when using batch normalization as you will subtract away
these biases when you normalize. These bias parameters will be estimated
as effectively zero vectors when you train.
Exponential moving average for batch means and variances
When you train your network with batch normalization you should keep
an exponential moving average estimate of the mean and variances for the
un-normalized scores for each layer that will be used during test time. You
can achieve this, after each forward pass of the mini-batch gradient descent
algorithm (which generates a new µ
(l) and v
(l)
), by setting:
for l = 1, . . . , k − 1
µ
(l)
av = αµ
(l)
av + (1 − α)µ
(l)
(38)
v
(l)
av = αv
(l)
av + (1 − α)v
(l)
(39)
where α ∈ (0, 1) and typically α ≈ .9 (in this assignment as our training is
shorter than usual and this requires a smaller value for α). You can initialize
µ
(l)
av
to be equal to the µ
(l) obtained from the very first mini-batch update
step and similarly for v
(l)
.
Exercise 1: Upgrade Assignment 2 code to train & test k-layer networks
In Assignment 2 you wrote code to train and test a 2-layer neural network.
For the first part of this assignment you should upgrade your code from
Assignment 2 so that you can train and test a k-layer network. If you have
a decent architecture for your code, this should not involve too much coding.
You will need to refine the functions and data structures that you use
1. to store and initialize the parameters of your network,
2. to apply the network to input vectors and keep a record of the intermediary scores when you apply the network (the x
(l)
’s in equation 4)
(forward pass),
3. to compute the gradient of the cost function for a mini-batch relative
to the parameters of the network using the gradient equations in the
lectures notes (backward pass).
5
When you have upgraded your code you should debug the gradient computations and check them numerically as previously. As before please only do
numerical checks on networks with a small number of nodes in each layer
and a much reduced dimensional input data (d ≈ 10) to avoid numerical
precision issues. You should start with a 2-layer network, then a 3-layer
network and then finally a 4-layer network. You’ll probably notice that the
discrepancy between the analytic and the numerical gradients increases for
the earlier layers as the gradient is back-propagated through the network.
You can re-read the relevant section of the Additional material for lecture
3 from Standford’s course Convolutional Neural Networks for Visual
Recognition to get all the tips and potential issues. But remember to make
your checks initially with lambda=0. Once you have convinced yourself that
your analytic gradient computations are bug free then you should continue
with the assignment.
Exercise 2: Can I train multi-layer networks?
First check, with the new version of your code, you can replicate the (default)
results you achieved in Assignment 2 with a 2-layer network with 50 nodes
in the hidden layer using mini-batch gradient descent with a cyclic learning
rate. If the answer is yes, then your next task is to train a 3-layer network
with 50 and 50 nodes in the first and second hidden layer respectively with
the same learning parameters. You can use the following hyper-parameter
setting n batch=100, eta min = 1e-5, eta max = 1e-1, lambda=.005, two
cycles of training and n s = 5 * 45,000 / n batch. You should use a careful initialization such as Xavier or He initialization. With these settings my
trained network after two cycles got a test accuracy of ∼52%. I also randomly shuffle the order of the training data after each epoch. Now consider
a 9-layer network whose number of nodes at the hidden layers are [50, 30, 20,
20, 10, 10, 10, 10]. This network has the same number of weight parameters
as the earlier network. Train the network with the same hyper-parameter
settings as before and see what happens to performance.
For the deeper network its performance dropped by quite a bit. Generally
as a network becomes deeper it becomes harder to train when training with
variants of mini-batch gradient descent and using a more standard decay of
the learning rate. The technique of batch normalization is a way to overcome
this difficulty.
Exercise 3: Implement batch normalization
You have seen that training networks with many layers can be tricky. In the
lecture notes I told you batch normalization overcomes this problem. Now
6
it’s your turn to implement it.
First, consider the forward pass where you apply the network to the input
data in a mini-batch. You will have written, for the first part of this assignment, a function that evaluates the network on a mini-batch of input data
and returns the probability score and the intermediary activations (for each
hidden layer) for each example in the mini-batch. In batch normalization
you will need to augment your code so that it implements equations (12) -
(19) (and returns the intermediary vectors needed by the backward pass).
In the first version of your new function you should write it assuming the
layer means and variances are computed from the mini-batch of data sent
into the function. You will, however, also call this function at test time and
in this case it is assumed that the un-normalized scores are normalized by
known pre-computed means and variances that have been estimated during
training. Thus you should write a final version of the function so that it
can take a variable number of inputs depending on whether you send it precomputed means and variances or not. You can do this in Matlab using the
varargin cell structure. Use the help command to get more details.
Note: If you store your un-normalized scores for a batch ar the lth layer
in the matrix scores (this would correspond to S
(l)
batch in the mathematical
description) of size m × n where n is the number of examples in the minibatch then this Matlab code will compute the variance for each dimension:
var scores = var(scores, 0, 2);
The matlab function var computes the variances by dividing the relevant
sum-of-squares quantities by n-1, however, in the original batch normalization paper it is assumed the variance is computed by dividing by n instead.
The back-propagation equations in the lecture slides assume the latter therefore you will have to compensate for this fact by applying:
var scores = var scores * (n-1) / n;
Next up is implementation of the backward pass. You should upgrade the
functions in the first part of the assignment to implement equations (21)-
(30). Once you have completed this then it is time to check your analytic
gradient computations as per usual. Just a couple of tips:
• When you compute the loss in the numerical calculation of the gradient
you have to apply the network function to the mini-batch data. When
you do this you have to apply batch normalization and you should, as
in your analytic gradient computations, compute the un-normalized
means and variances from the mini-batch data.
7
• Make sure your mini-batch has size 1. You want to make sure your
mean and variance computations are okay.
You should check with a 2-layer network (with 50 hidden nodes) and then
a 3-layer network (with 50 and 50 hidden nodes respectively). After you
have convinced yourself that your gradient computations are okay then you
should move on to training your network. (The numerical gradient computations from Assignment 2 have to be augmented with the numerical gradient
computations w.r.t. the parameters γl and βl
.)
There is just one upgrade you need to make in the top level function implementing the mini-batch gradient descent learning algorithm. You need to
keep an exponential moving average of the batch mean and variances for the
un-normalized scores for each layer of your network as defined by equations
(38) and (39). You should use these moving averages when you compute the
cost and accuracy on the training and validation sets after each epoch.
You should train a 3-layer network with 50 and 50 nodes in the first and
second hidden layers respectively. You should train as in Assignment 2
with cyclic learning rate. I achieved quite good results (when using 45,000
training examples) with He initialization and hyper-parameter settings of
eta min = 1e-5, eta max = 1e-1, lambda=.005, two cycles of training and
n s = 5 * 45,000 / n batch. With this set-up I was able to achieve test
accuracies of ∼53.5%. (To reach this level of accuracy when using BN it
seems to be important that you shuffle the order of your training samples
after each epoch. I think the reason for this is that it ensures you have
different combinations of training examples in your batches over epochs and
this is good for regularization and estimating the mean and standard deviation of the activations at each layer.) You should perform a proper some
coarse-to-fine search to find a good value for lambda. After you have found
a good setting for lambda, you should train a network for 3 cycles and see
what test accuracy this network can achieve.
Now reconsider the 9-layer network whose number of nodes at the hidden
layers are [50, 30, 20, 20, 10, 10, 10, 10] respectively. Train the network
with the same hyper-parameter settings as your 3-layer network and see
what happens to performance. Not bad! Hopefully this result will convince
you that batch normalization is a good thing!
The frequently stated pros of batch normalization are that training becomes
more stable, higher learning rates can be used as opposed to when batch
normalization is not used and it acts as a form of regularization. I would like
you to explore if you can get some experimental evidence that is consistent
with one of these stated pros. You will train your 3-layer network with 50
nodes at each hidden layer and basic hyper-parameter setting will eta min
= 1e-5, eta max = 1e-1, lambda=.005, two cycles of training with n s =
8
5 * 45,000 / n batch.
Sensitivity to initialization For each training regime instead of using
He initialization, initialize each weight parameter to be normally distributed
with sigmas equal to the same value sig at each layer. For three runs
set sig=1e-1, 1e-3 and 1e-4 respectively and train the network with and
without BN and see the effect on the final test accuracy. Use n s = 2 *
45,000 / n batch if training is slow on your machine and you want to
complete training with fewer update steps.
s53.
To complete the assignment:
To pass the assignment you need to upload to bilda:
1. The code for your k-layer network trained and tested with batch normalization assembled into one file.
2. A brief pdf report with the following content:
i) State how you checked your analytic gradient computations and
whether you think that your gradient computations are bug free
for your k-layer network with batch normalization.
ii) Include graphs of the evolution of the loss function when you
train the 3-layer network with and without batch normalization
with the given default parameter setting.
iii) Include graphs of the evolution of the loss function when you
train the 6-layer network with and without batch normalization
with the given default parameter setting.
iv) State the range of the values you searched for lambda when you
tried to optimize the performance of the 3-layer network trained
with batch normalization, and the lambda settings for your best
performing 3-layer network. Also state the test accuracy achieved
by this network.
v) Include the loss plots for the training with Batch Norm Vs no
Batch Norm for the experiment related to Sensitivity to initialization and comment on your experimental findings.
Exercise 4: Optional for bonus points
9
1. Optimize the performance of the network
It would be interesting to discover what is the best possible performance achievable by a k-layer fully connected network on CIFAR-10.
From a quick search of the web it seems the best performance of a fully
connected network on CIFAR-10 is 78%. The details of this network
are available at How far can we go without convolution: Improving
fully connected networks by Lin, Memisevic and Konda.
Here are some tricks/avenues you can explore to help bump up performance:
(a) Do a more exhaustive random search to find good values for the amount of
regularization.
(b) Do a more thorough search to find a good network architecture. Does making
the network deeper improve performance?
(c) It has been empirically reported in several works that you get better performance by the final network if you apply batch normalization to the scores
after the non-linear activation function has been applied. You could investigate whether this is the case. You will have to update your forward and
backward pass of the back-prop algorithm accordingly.
(d) Apply dropout to your training if you have a high number of hidden nodes
and you feel you need more regularization.
(e) Augment your training data by applying small random geometric and photometric jitter to the original training data. You can do this on the fly by
applying a random jitter to each image in the mini-batch before doing the
forward and backward pass.
Bonus Points Available: 1 bonus point for each non-trivial improvement (capped at 4 bonus points, you can follow my suggestions, think
of your own or some combination of the two.)
To get the bonus point you must submit
(a) Your code.
(b) Pdf document reporting on your trained network with the best
test accuracy, what improvements you made and which ones brought
the largest gains (if any!).
10

More products