ECBM E6040
INSTRUCTIONS: This homework contains two programming assignments. Submission for this homework will be via bitbucket repositories created for each student
and should contain the following
• All figures and discussions; document all parameters you used in the IPython
notebook file, hw3.ipynb, which is already included in the homework 3 repository.
• Commit and push all the changes you made to the skeleton code in the Python
files, hw3a.py and hw3b.py.
As the semester progresses, we are shifting our focus more and more towards programming.
In this homework, you will empirically study various regularization methods for neural
networks, and experiment with different convolutional neural network (CNN) configurations. You should start by going through the Deep Learning Tutorials Project,
especially, LeNet. The source code provided in the Homework 3 repository is excerpted from logistic sgd.py, mlp.py, and convolutional mlp.py.
As in the previous homework, you will be using the same street view house numbers (SVHN) dataset [1]. A recent ivestigation has achieved superior classification
results on the SVHN dataset with above 95% accuracy (by using CNN with some
modifications) [2].
Instead of reproducing the superior testing accuracy, your task is to explore the CNN
framework from various points of view.
As in the previous homework, a python routine called load data is provided to you
for downloading and preprocessing the dataset. You should use it, unless you have
absolute reason not to. The first time you call load data, it will take you some time
to download the dataset (about 180 MB). If you already have the dataset on the EC2
volume, you should simply reuse it. Please be careful NOT TO commit the dataset
files into the repository. In addition to load data, you are provided with various
skeleton functions.
Note that all the results, figures, and parameters should be placed inside the IPython
notebook file hw3.ipynb.
PROBLEM a (50 points)
In this problem, you are asked to empirically test several regularization methods for
neural networks, which are discussed in Chapter 7 of the textbook. To better see the
effect of regularization, you will be using a smaller training dataset down-sampled
from the original SVHN dataset (generated by load data with an additional input
argument ds rate). The testing dataset remains the same.
You will start by training a neural network model without any regularization, except
optionally with L
1 or L
regularization. The testing result of this model serves as a
baseline for comparison against different regularization methods. If you do use L
1 or
regularization for the baseline model, you should also include them with the same
parameters for other models with different regularization methods.
For neural network, you could use either MLP or CNN (from Problem b). A myMLP
class has been provided to you.
i Implement an MLP or a CNN, and train it with the smaller dataset. Then,
train the same model again with the complete dataset. Document your choice of
parameters, and report the testing accuracy in both cases. For MLP, you could
reuse any sets of parameters that you implemented in the previous homework.
ii Noise injection is a common method for regularization when the dataset is
limited. For each example in the smaller dataset, generate several copies and
and add a randomly sampled noise vector to each of them. A skeleton function
test noise inject at input is provided to you. Train the same model from
(i) with the new noisy dataset. Repeat the same procedure with another level of
noise. Document your choice of noise, discuss the testing accuracy, and compare
the result with those in (i).
iii Another way of noise injection is to inject it into the weights of affine transformation between layers. A skeleton function test noise inject at weight
is provided to you. Train the same model from (i) with the smaller dataset,
but inject noise into the weights after each of the updates (More specifically,
you need to modify the updates routine in the skeleton code). Document your
choice of noise, discuss the testing accuracy, and compare the result with those
in (i).
iv Data augmentation is another way to overcome the limitation of small datasets.
It has been a particularly effective method for object recognition. You are asked
to synthesize new data to augment the smaller dataset, and then train the model
with the synthesized dataset. To do so, you create 4 new examples for each of
the examples in the dataset by translating the example by 1 pixel along four
different directions, and padding zeros to the missing part. If you have other
ideas about data augmentation, you could implement them instead of using the
one described here. A skeleton function test data augmentation is provided
to you. Train the same model from (i) with the new dataset. Document your
choice of noise, discuss the testing accuracy, and compare the result with those
in (i).
v Recent work has shown that one can fool a neural network with adversarial examples [4]. Such phenomenon is also discussed in section 7.13 in the textbook.
You are asked to test and reproduce this phenomenon. To do so, take any
models you trained in previous questions, and compute the gradient of the cost
function with respect to the input (Please review section 7.13 of the textbook).
Then, create an adversarial example by first picking an example with correct
classification and adding an imperceptibly small vector to the input whose elements are equal to the sign of the elements of the gradient. Use the trained
model to classify this adversarial example. Can you fool the model? Discuss
your results, and plot the original input along with the adversarial example and
a bar plot of the class-specific probabilities (output of the neural network) for
the original input and the adversarial example.
PROBLEM b (50 points)
In this problem, you will experiment with convolutional neural networks. The CNN
model is similar to LeNet from the Deep Learning tutorial, with the exception that
it handles images with 3 color channels, whereas LeNet targets grey-scaled image.
i Implement an CNN with 2 convolution hidden layers for multi-channel inputs.
First, go through the skeleton function test lenet() in hw3b.py, and finish
the missing part. After finishing the function, experiment with parameters,
in particular, the number of filters in hidden layers. Document at least three
different sets of parameters explicitly, and discuss the accuracy of your test
ii Implement a multi-stage CNN, as shown in Figure.1. First, go through the
skeleton function test convnet() in hw3b.py, and finish the missing part. After finishing the function, experiment with all the parameters, in particular, the
number of filters in hidden layers as well as the shape of filters. Document at
least three different sets of parameters explicitly, and discuss the accuracy of
your test results.
iii The multi-stage CNN model you implemented in the previous question has a
2 Architecture
The ConvNet architecture is composed of repeatedly
stacked feature stages. Each stage contains a convolution module, followed by a pooling/subsampling module and a normalization module. While traditional pooling modules in ConvNet are either average or max poolings, we use an Lp pooling here. The normalization
module is subtractive only as opposed to subtractive and
divisive, i.e. the mean value of each neighborhood is
subtracted to the output of each stage (but not divided
by the standard deviation as it decreases performance
with this dataset). Finally, multi-stage features are also
used as opposed to single-stage features. This architecture is trained using stochastic gradient descent (SGD)
with the Levenberg-Marquardt diagonal approximation
to the Hessian [7].
2.1 Lp-Pooling
Figure 2. L2-pooling applied to a 9x9 feature map
with a 3x3 Gaussian kernel and 2x2 stride
Lp pooling is a biologically inspired pooling layer
modelled on complex cells [13, 5] who’s operation can
be summarized in equation (1), where G is a Gaussian
kernel, I is the input feature map and O is the output
feature map. It can be imagined as giving an increased
weight to stronger features and suppressing weaker features. Two special cases of Lp pooling are notable.
P = 1 corresponds to a simple Gaussian averaging,
whereas P = ∞ corresponds to max-pooling (i.e only
the strongest signal is activated). Lp-pooling has been
used previously in [6, 16] and a theoretical analysis of
this method is described in [1].
O = (! !I(i, j)
P × G(i, j))1/P (1)
Figure 2 demonstrates a simple example of L2-
2.2 Multi-Stage Features
Multi-Stage features (MS) are obtained by branching out outputs of all stages into the classifier (Figure 3). They provide richer representations compared
to Single-Stage features (SS) by adding complementary
information such as local textures and fine details lost
by higher levels. MS features have consistently improved performance in other work [4, 12, 10] and in
Figure 3. A 2-stage ConvNet architecture where
Multi-Stage features (MS) are fed to a 2-layer classifier. The 1st stage features are branched out, subsampled again and then concatenated to 2nd stage features.
this work as well (Figure 4). However we observe minimal gains on this dataset compared to other types of
objects such as pedestrians and traffic signs (Table 1).
The likely explanation for this observation is that gains
are correlated to the amount of texture and multi-scale
characteristics of the objects of interest.
3. Experiments
3.1. Data Preparation
The SVHN classification dataset [9] contains 32x32
images with 3 color channels. The dataset is divided
into three subsets: train set, extra set and test set. The
extra set is a large set of easy samples and train set is
a smaller set of more difficult samples. Since we are
given no information about how the sampling of these
images was done, we assume a random order to construct our validation set. We compose our validation
set with 2/3 from training samples (400 per class) and
1/3 from extra samples (200 per class), yielding a total of 6000 samples. This distribution allows to measure success on easy samples but puts more emphasis
on difficult ones. The training and testing sets contain
respectively 598388 and 26032 samples.
Samples are pre-processed with a local contrast normalization (with a 7x7 kernel) on the Y channel of the
YUV space followed by a global contrast normalization
over each channel. No sample distortions were used to
improve invariance. For some experiments, a padding
of 2 pixels with zero value was added to each side of
the input image in order to center the first stage’s 5x5
filters onto image borders.
3.2 Architecture Details
The ConvNet has 2 stages of feature extraction and
a two-layer non-linear classifier. The first convolution
layer produces 16 features with 5x5 convolution filters
while the second convolution layer outputs 512 features
with 7x7 filters. The output to the classifier also includes inputs from the first layer, which provides loFigure 1: Excerpted from [2]. A 2-stage CNN architecture where multi-stage features
(MS) are fed into a 2-layer classifier. The first stage features are branched out, downsampled again and then concatenated to second stage features.
nonstandard feed-forward structure, but the THEANO package is still able to
compute the gradient of the cost function with respect to different parameters
via the back-propagation algorithm. Discuss why the back-propagation algorithm can be applied to this model. You might want to review the section about
the back-propagation algorithm in the textbook.
iv The state-of-the-art neural networks for object recognition usually implement
a CNN in cascade with a MLP[3]. Implement a network with two convolution
layers in cascade with a MLP with 2 hidden layers. Train the model, and
document the testing accuracy. How does this model perform compared to
your implementation of the MLP with 4 hidden layers in Homework 2?
BONUS PROBLEM (25 points)
i A nice advantage of CNNs that separates it from other machine learning models
is that it is capable of learning features all the way from pixels to the classifier,
whereas other methods usually require multiple hand-crafted features. You are
asked to compare the performance of a CNN with hand-picked features versus
one with learned features. Specifically, use a CNN from the first question with
3 filter sets at the input layer (each set has 3 filters for each color channels),
and train the whole network. Then, use the same CNN model, but replace
the 3 filter sets of the input layer with your own design (ex., Gaussian filters).
For each filter set, you can use the same filter for each color channel. Train
the model without updating the designed filters. Document and compare the
testing accuracy of both models, and plot the filters learned via training and
the filters you designed.
ii Another advantage of a CNN is that it greatly reduces the number of parameters
in the network. The fewer parameters imply that a CNN can be usually trained
in shorter time than a MLP with same amount of neurons and layers. You are
asked to compare a CNN and an MLP. In particular, implement a CNN with
two convolution hidden layers, and a fully-connected MLP with three hidden
layers (Note that in a CNN, convolution hidden layers are followed by a fully
connected perceptron and an output layer). For each layer of the MLP, use the
same number of neurons (activation functions) as the corresponding layer in the
CNN. You can reuse the CNN from (i). Document the number of parameters of
both models (total number of entries in all filters for CNN, and total number of
entries in all weight matrices for MLP). Discuss the run-time for both models,
and the testing accuracy.
If you have any questions you are advised to use Piazza forum which is accessible
through Canvas system.
[1] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew
Y. Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,”
NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011.
[2] Pierre Sermanet, Soumith Chintala, Yann LeCun, “Convolutional Neural Networks Applied to House Numbers Digit Classification,” ICPR 2012
[3] Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, Bhuvana Ramabhadran, “Deep convolutional neural networks for LVCSR,” ICASPP 2013
[4] Anh Nguyen, Jason Yosinski, Jeff Clune, “Deep Neural Networks are Easily
Fooled: High Confidence Predictions for Unrecognizable Images,” IEEE CVPR