$30
Programming Assignment 1
CSE 253: Deep Learning
Instructions
1. Please submit your assignment on Gradescope. The instructions for this will be coming soon. There are two
components to this assignment: mathematical solutions/proofs with English explanations (Part I: Problems
1 2). For the programming assignment portion of the homework (Part II: 1, 2, and 3), you will be writing a
report in a conference paper format for this assignment, reporting your findings. All parts of the assignments
must be typeset, including figures. We strongly recommend that you use some dialect of TEXor LATEX and
use NIPS (one of the top ML/DL conferences) format, but this is not required. You may also use Word if you
so choose, and figures may be generated with Excel or Python, so long as they are computer generated. We
will not be accepting any handwritten work - this includes the “written part.” NIPS templates in
LATEX and Word are available from the 2015 NIPS format site. The page limits mentioned there don’t apply.
2. For the group report, include an informative title, author list, and an abstract. The abstract should summarize
briefly what you did, and the best percent correct you got on each problem. The report should have clear
sections for the three programming parts, as well as a fourth section where each team member says
what their contribution to the project was. The report should be well-organized with an introduction,
background (if you review previous work), methods, results, and discussion for each programming part. Figures
should be near where they are referenced, there should be informative captions on figures, clearly specified
axes and figure keys, etc.
3. You are expected to use Python (usually with NumPy). You also need to submit all of the source code files
and a readme.txt file that includes detailed instructions on how to run your code.
You should write clean code with consistent format, as well as explanatory comments, as this code may be
reused in the future.
4. Using any off-the-shelf code is strictly prohibited.
5. If you end up dropping the class and your teammate does not, you are expected to help your
teammate anyway! Please don’t leave your teammate(s) without your assistance. Being on a team means
just that: teamwork! When you join a team, you have made a commitment. Please honor it.
6. Any form of copying, plagiarizing, grabbing code from the web, having someone else write your code for you,
etc., is cheating. We expect you all to do your own work, and when you are on a team, to pull your weight.
Team members who do not contribute will not receive the same scores as those who do. Discussions of course
materials and homework solutions are encouraged, but you should write the final solutions alone. Books,
notes, and Internet resources can be consulted, but not copied from. Working together on homework must
follow the spirit of the Gilligan’s Island Rule (Dymond, 1986): No notes can be made (or recording of
any kind) during a discussion, and you must watch one hour of Gilligan’s Island or something equally insipid
before writing anything down. Suspected cheating has been and will be reported to the UCSD Academic
Integrity office.
1
Part I
Problems to be solved and turned in individually
For this part we will not be accepting handwritten reports. Please use latex or word for your report. MathType is
a handy tool for equations in Word. The free version (MathType Lite) has everything you need.
1. Problems from Bishop (20 points)
Work problems 1-4 (5 points each) on pages 28-30 of Bishop. Note: In Equation 1.51, the argument of exp
should be (−?
2/σ2
). Correction in Q1.2, perpendicular to one of the face not the edge.
2. Logistic Regression (5 points)
Logistic regression is a binary classification method. Intuitively, logistic regression can be conceptualized as
a single neuron reading in a d-dimensional input vector x ∈ R
d and producing an output y between 0 and
1 that is the system’s estimate of the conditional probability that the input is in some target category. The
“neuron” is parameterized by a weight vector w ∈ R
d+1, where w0 represents the bias term (a weight from a
unit that has a constant value of 1).
Consider the following model parametrized by w:
y = P(C1|x) = 1
1 + exp(−wx)
= g(w
x) (1)
P(C0|x) = 1 − P(C1|x) = 1 − y, (2)
where we assume that x has been augmented by a leading 1 to represent the bias input. With the model so
defined, we now define the Cross-Entropy cost function, equation 3, the quantity we want to minimize over
our training examples:
E(w) = −
X
N
n=1
{t
n
ln(y
n
) + (1 − t
n
) ln(1 − y
n
)} . (3)
Here, t
n ∈ {0, 1} is the label or teaching signal for example n (t
n = 1 represents x
n ∈ C1). We minimize this
cost function via gradient descent.
To do so, we need to derive the gradient of the cost function with respect to the parameters wj . Assuming
we use the logistic activation function g as in equation 1, prove that this gradient is:
−
∂E(w)
∂wj
=
X
N
n=1
(t
n − y
n
)x
n
j
(4)
2
Part II
Programming Assignment
Requirements: Write your own Logistic Regression and Softmax Regression classifiers using Python, following
the instructions below.
You are allowed to use the following Python libraries and packages: NumPy, PIL, matplotlib, os, and random.
You are not allowed to use any SciPy implementations or logistic or softmax regression or any other high-level
machine learning packages (including, but not limited to TensorFlow, PyTorch, Keras, etc.).
Note: Yes, I may have said in class this first programming assignment would be individual - but it is a fair amount
of work in 10 days. If you want to do it individually, it’s ok for this assignment, but for the second assignment, it
will be required for you to find a team to join. You may as well figure that out now!
For this assignment, a “team” is defined as two or three people.
1. Load and preprocess the data (5 points).
Dataset: In this problem, we will use logistic regression to separate faces of different emotions (for example,
discriminate happy faces from sad faces). We will use facial expressions from the Extended Cohn-Kanade
Dataset (CK+) paper. We are providing you with two sets of data from that dataset: resized and aligned.
The resized set is simply re-sized from the original. This is provided to show you what happens with PCA
when the data is not aligned. The aligned set has been preprocessed to align the faces. The six facial
expressions included in the portions of the dataset we are providing you are: anger, disgust, fear, sadness,
surprise, and happiness. A zip version of these datasets has been uploaded to Piazza under “Resources”, as
well as a basic dataloader: the output of the dataloader is stored as a Python dictionary with facial expression
keys and corresponding values as a list of NumPy arrays for the images associated to the expression. Since
for each expression, the number of images varies, we have also provided a function “balanced_sampler” for
you to sample a balanced dataset in dataloader.py on specified groups of emotions. Calling this function
will ensure that you have nearly the same number of images for each facial expression. You don’t have to use
this function, you may write your own.
2. Cross Validation Procedure (5 pts for a correct implementation).
We will use the method of k-fold cross-validation to estimate our model’s performance on unseen data. For
each problem below, you should divide the data into k mutually exclusive sets. Each set should contain a
representative sample with respect to the problem. For example, we will have you recognize happy and sad
faces. In this case, it would not be good to have one of the sets contain all happy faces or all sad faces.
Basically, you want to have roughly equal amounts of each category in each of the k sets. Don’t worry if they
don’t divide up perfectly equally; but make sure they are mutually exclusive! This is a common error - some
researchers have mistakenly included some of the training data in the test data! This is Very Bad. Don’t do
it!
Repeat k times: Choose two of the sets to be holdout (or validation) and test sets. Then, train your model
on the remaining k − 2 of the sets. Each of the sets should be the test set once and the holdout set once.
Psuedocode for this is in the Training algorithm below.
We use the holdout set in a special way: It is used to check whether your model is overfitting on the training
set. The error on the holdout set should go down as you train the model, even though it is not being used to
change the weights. However, at some point, the holdout error should start to rise. For this problem, in our
experiments, this never happens (except on the unaligned data - see below), so perhaps this problem is too
easy to display overfitting. This means the model is overfitting and you should stop training. This is called
early stopping. At that point, you run the trained model on the unseen test set, and record the performance.
It is possible that the holdout set error never goes up over the M epochs. This will happen if the problem
is too easy or if the holdout set is too easy. So you should also have some limit on the number of training
steps, say 100 passes through the complete training data. One pass through all of the training data is called
an epoch.
3
3. Principal Components Analysis (5 pts for a correct implementation).
Below, we ask you to perform dimensionality reduction on the images first, by using a linear dimensionality
reduction technique called Principal Components Analysis, or PCA to its friends. Note also, it is PrincipAL
Components Analysis, not PrincipLE Components Analysis. I personally hate that typo! You should have
realized that the role of the holdout set is to stand in for the unseen test set, to avoid overfitting. Hence,
the PCA in each case should only be performed on the training set. In real life, we would not have access
to the test set. Hence we would have to use the PCA we used for our data on that new data to reduce its
dimensionality. So, we should also treat the holdout set the same way. We will apply the PCA computed on
the training set to the holdout and test sets without change.
Hence, you will be doing PCA many times, so we should do it as efficiently as possible! It is ok to use
an eigenvalue/eigenvector routine, but see the Supplementary Section at the end to find out how to do this
efficiently, by using what my lab used to call the “Turk and Pentland trick,” because we learned it from a
paper by Turk and Pentland. The description of this trick is provided in the supplementary section at the
end of this document. Try keeping three different numbers of components (e.g., when training on all emotions,
try 10, 20, and 40; when training on only two emotions, try a much smaller number), and report your results.
4. What to report for each experiment (except where noted) Loss and Performance Plots (10 pts)
As described above, you need to divide the dataset into a training set, a validation (or hold-out) set, and a
test set, k times. (When you are just developing your model to see if it works, though, you can just use one
training/holdout/test split until you are sure your code is working.)
For each question in this assignment, we ask you to repeat your experiment with k = 10, so we will be
using 10-fold cross validation. During your training process, you want to update your model parameters with
training set, and save the parameters of the model with the lowest validation loss as your best model, which
you then use to measure performance on the test set. This is described in the “Training Procedure” below.
In order to give you insight into the process (and check that your code is working), we want you to plot the
average training and holdout error, and its standard deviation, at each epoch. The average is over the 10 runs
of cross-validation. Since the holdout error should be lowest at different points during training with different
folds of the data, pick some maximum number of epochs M for all of the runs and run all models for that
maximum number. Here we suggest M =50.
We also ask you to report your final performance in the text of your report. This should not be based on your
best performance on the test set! That’s called “Data Snooping”, and is also Very Bad, don’t do it! What we
mean by final performance is the average over the 10 runs, but where in each of the 10 runs, you record the
error on the test set when the error on the holdout set is lowest. It is this test set error at that point in each
run that should be averaged When reporting your final performance. Here by “error”, we mean the loss, i.e.,
the cross-entropy. In order to normalize this over datasets with different numbers of examples, and different
numbers of outputs, you should report the average loss over the number of examples, N, and the number of
outputs, c.
The error isn’t a particularly intuitive measure of performance, except we do want it to be low. Hence, we
also want a plots of percent correct on each set. The percent correct is determined by recording how often
the model would choose the correct answer for an image.
To be clear, then, each experiment with one setting of the hyperparameters (learning rate and number of
principal components) will result in two plots: a plot of the average loss and a plot of the performance over
the 50 training epochs for the training set and the holdout set These training and holdout curves should be
easily distinguishable in your plots by using different colors or solid and dashed lines. We also want you to
plot the standard deviation of your result as error bars for every 10 epochs (i.e., at 10, 20, ...,50 epochs). See
this page for an example of a plot with error bars.
5. Logistic Regression (25 points)
Here, we build and evaluate classifiers on the data. We will experiment with discerning between two classes
of data, and with multi-class classification.
(a) Implement Logistic Regression via Gradient Descent. (5 points)
Now, without using any high-level machine learning libraries, implement logistic regression. Here, you’ll
be using batch gradient descent, and will only need one logistic output unit. (Think about why we only
need one if we’re classifying two classes?) The most points will be given for clean, well-documented code.
4
1: procedure Training Procedure
2: folds = k mutex split of training data;
3: for fold = 1 to k do
4: val, test ← folds[fold], folds[(fold+1) mod k];
5: train ← remaining folds;
6: Perform PCA on train;
7: Project all data onto top p train PC’s, scaled as in text.
8: for epoch = 1 to M do
9: train the model with train, and test the performance on val every epoch
10: record test,val loss and accuracy from each epoch (for plotting)
11: save the best model based on val performance
12: use the best model to record loss and accuracy on test
13: average test loss and accuracy from all k trials; . Put in a table in your report
14: plot the average and standard deviation of training and validation curves
Algorithm 1 Two Approaches to Gradient Descent
1: procedure Batch Gradient Descent
2: w ← 0
3: for t = 1 to M do . Here, t is one epoch.
4: wt+1 = wt − α
PN
n=1 ∇En(w)
5: return w
1: procedure Stochastic Gradient Descent
2: w ← 0
3: for t = 1 to M do . Here, t is one epoch.
4: randomize the order of the indices into the training set
5: for n=1, . . . , N do
6: wt+1 = wt − α∇En(w)
7: return w
5
(b) Evaluate the model on Happiness vs Anger using the resized dataset. (5 points)
Here we ask you to perform logistic regression on the happy and angry faces from the resized dataset.
For this experiment only, you don’t need to do cross-validation! Just do one run with a
80/10/10 percent split of the data. Then plot the usual curves and report the percent correct on
your test set. Since there is only one run, you don’t need to plot standard deviation, as there won’t be
any! On this dataset, the holdout error should go down and then up again, quite early in training, so
be sure to save your best weights to use for the test set when the holdout set error is at its lowest point.
Your results should be poor. Why is that? To understand why, present a Figure showing the first 4
principal components plotted as images.
(c) Evaluate on Happiness vs Anger on the aligned dataset. (10 points)
Here, repeat the above experiment, using 10-fold cross-validation, on the aligned dataset.
i. As described above, plot the average loss curves over the 10 runs for the training and holdout sets,
including standard deviation error bars. If the error blows up, your learning rate is too high.
Use early stopping based on the holdout set error. I.e., as described above, save the weights any
time the holdout set error is less than it has been before. If it goes back up again, you still have the
best weights. However, we have found that it doesn’t go up again for this data.
For one of the principal components calculation, present a figure with the first four principal components presented as images. This should look very different than the ones from the unaligned
case!
Make sure that your graph is well-labeled (i.e., x and y axes labeled with what they are) with
a key, and with a figure caption that clearly states what is being shown in the graph. This should
be the case for any graph you plot.
ii. Report the test accuracy after training. Here, by “accuracy”, we mean the percent correct when
choosing the category according to the rule above ( 0.5 = “Happy”). Again, this should be an
average over the 10 runs with the standard deviation in parentheses (e.g., like this: 94.4% (1.2).
iii. Repeat the above for 3 different learning rates (i.e., 3 total). Try and find one that’s too high, one
that’s too small, and one that’s just right. Plot the training set error (cross-entropy loss) for the
three different learning rates on one plot, including the standard deviation as above. For your report
above, use the best learning rate you found.
(d) Evaluate on Fear vs Surprise on the aligned dataset. (5 points)
i. Plot the cross-entropy loss over training epochs averaged over the ten runs. Use the best learning
rate you found above.
ii. Report the test accuracy averaged over the 10 runs.
iii. Does this differ from what we observed above? If so, why do you think that is?
6. Implement Softmax Regression via Gradient Descent. (25 points)
Softmax regression is the generalization of logistic regression for multiple (c) classes. Now given an input
x
n, softmax regression will output a vector y
n, where each element, y
n
k
represents the probability that x
n is
in class k.
y
n
k =
exp(a
n
k
)
P
k0 exp(a
n
k0 )
(5)
a
n
k = w
T
k x
n
(6)
Here, a
n
k
is called the net input to output unit yk. Equation 5 is called the softmax activation function, and
it is a generalization of the logistic activation function. For softmax regression, we use a one hot encoding of
the targets. That is, the targets are a c-dimensional vector, where the k
th element for example n (written t
n
k
)
is 1 if the input is from category k, and 0 otherwise. Note each output has its own weight vector wk. With
our model defined, we now define the cross-entropy cost function for multiple categories in Equation 7:
E = −
X
n
Xc
k=1
t
n
k
ln y
n
k
(7)
6
Again, taking the average of this over the number of training examples normalizes this error over different
training set sizes. Also averaging over the number of categories c makes it independent of the number of
categories. Please take the average over both when reporting results. Surprisingly, it turns out that the
learning rule for softmax regression is basically the same as the one for logistic regression! The gradient is:
−
∂En(w)
∂wjk
= (t
n
k − y
n
k
)x
n
j
(8)
where wjk is the weight from the j
th input to the k
th output.
Now, we’ll modify our network to classify all six emotion categories (anger, disgust, fear, happiness,
sadness, and surprise). To achieve multi-class classification, we’ll need more output units, and use the
gradient derived for Softmax Regression. As a sanity check, make sure your outputs are all positive and sum
to 1. You will be using one-hot encoding of the targets here. When choosing the category for evaluating
percent correct, you just choose the maximum output (as in Lecture 1).
(a) Evaluate your network on all six emotions. (10 points)
Again, here, just choose one of the happy faces for each person so that there is the same amount of data
for each emotion category. What do you think would happen if we used all of the happy faces? (You
can try it if you like). Follow the same instructions as for Happiness vs. Anger - repeat this ten times
using a different subject for the test set each time, use early stopping with a hold-out set, etc., plotting
the standard deviation at 10, 20, 30, 40, and 50 epochs.
i. Plot the loss on the training set as well as the loss on the holdout set vs. number of epochs of gradient
descent. In a case like this, we naturally find the maximum output, say y
n
i
for input pattern x
n and
choose the i
th category as the system’s choice. Again, plot the average over 10 runs.
ii. Create a table that is a 6 × 6 confusion matrix for this data based on the test set results. The
emotions should be listed along the top and left side of the matrix. The confusion matrix will have
36 entries, Cij , where i is the correct emotion and j is the emotion chosen by the network, and Cij is
the percent time that j was chosen for i. Hence the rows should add to 100%. A perfect system will
have 100% down the diagonal and 0’s everywhere else. Your system will most likely not be perfect!
(b) Batch versus stochastic gradient descent. (5pts)
i. Using your softmax network, implement the stochastic gradient descent algorithm above. Note you
now have another for loop inside the "epoch" for loop. To randomize the patterns, use an indirect
indexing method, i.e., have an integer array of N indices, and permute the order of that array. Let’s
call this array P. So initially, P[i] = i. Then on each epoch, you simply permute the integers in P.
Let’s call the N × d matrix of input patterns Input. For each epoch, as i goes from 1 to N on the
innermost loop, we train on the input Input[P[i]],instead of Input[i].
ii. Plot the training set loss over training epochs, with one curve for the batch mode above, and one for
stochastic gradient descent. Is stochastic gradient descent faster, in terms of minimizing the error
in fewer epochs? Explain your result.
(c) Visualize the weights (5 points).
Since there are as many weights as there are pixels, we can visualize the weights. Using one of your
trained networks, transform the learned weights for each emotion into the range 0-256 (ignore the bias
weight). That is, simply linearly scale them so the minimum weight for each emotion is 0 and the
maximum is 256. (This will be different for each emotion). Then, visualize these weights as an image.
What do you see? Explain why you end up seeing this. Include the six images in your report as a
well-labeled figure.
(d) Extra Credit (5 points).
So far, we have asked you to use a balanced dataset for your model training. Can you perform a softmax
regression using the entire dataset using techniques that handle the imbalanced nature of the dataset?
7. Individual Contributions to the Project
Don’t forget to include a paragraph at the end of your report, one per team member, describing what that
team member’s contribution to the project was.
7
Supplementary Section: Principal Component Analysis
In this programming assignment, we are asking you to perform principal component
analysis (PCA) as a dimension reduction technique on your training images. Here
are a few more hints on how to implement your PCA. Have the Turk & Pentland
article (posted under resources - go to the bottom of “Readings” and click on “See
all 14 resources” and it will show up!) in hand, as I am going to refer to it a lot in
what follows.
On page 74 (page 4 of the pdf), lower left side, they start to explain PCA. They
explain it the “normal way” first (a method that results in a huge matrix - the way not
to do it), then they explain what we used to call “the Turk Pentland trick”, although
it’s been known to mathematicians for about 70 years. I didn’t explain this before
because most PCA routines automatically choose the smaller dimension (using the
notation in the article), either M, the number of images, or d, the dimension of
the images (number of pixels). In the example in the paper, they are assuming the
images are N × N, so d = N2
1. The first step in either method is to subtract the mean face from every face.
This is described in the first four lines of the paragraph starting with "Let the
training set of faces images be..." The Φi
’s are the centered data.
2. Equation 3 shows constructing the covariance matrix. In this equation, the Φn’s
are column vectors, so this is the outer product of the two vectors, resulting in an
N2 ×N2 matrix. They then rewrite this as AAT
, where A is a matrix composed
of the normalized images as column vectors (although this is missing a factor of
1/M, you will get the same eigenvectors, just longer). It is the eigenvectors of
this matrix that are the principal component vectors.
3. Now, since there can’t be more than M-1 eigenvectors, as one point in this high
dimensional space (i.e., one image), can’t have a principal component as it is
just a point, but 2 points can, etc. The “Turk and Pentland trick” is shown in
equation 4, where instead of AAT
, which is N2×N2
, they form the matrix ATA,
which is M × M, a much smaller matrix. [A is d × M, so AT
is M × d (where
d is the number of pixels in the images), so ATA is a M × d matrix times an
d×M matrix, so ATA is M ×M.] So if you have 10 images, this matrix will be
10 × 10. They find the eigenvectors of that matrix (vi), i.e., vectors such that
ATAvi = λivi
, where λi
is the eigenvalue.
4. Equation 5 shows that the Avi
’s are actually the eigenvectors of the original huge
matrix C. So, to find the eigenvectors of that huge matrix C, you never actually
form that matrix. Instead, you form the matrix ATA, find the eigenvectors of
8
that, and then the Avi
’s are the eigenvectors of C. Now, A is d×M, vi
is M ×1,
so Avi
is d × 1, i.e., it is a column vector of the same dimension as the images
when they are treated as a long column vector. Note: You need to sort the
eigenvectors by their eigenvalues: the eigenvector with the largest eigenvalue is
the first principal component, the one with next largest eigenvalue is the second
principal component, etc.
Sanity Check:
1. Project each (centered) image onto the first principal component, i.e., you compute this:
Φ
T
i Av1/||Av1||, which is a scalar, for each image Φi
. (the "1" refers to the fact
that this is the eigenvector with the biggest eigenvalue, i.e., the first principal
component).
2. If you do this for all M images used to compute the principal components, you
will get a bunch of scalars whose average is 0, and whose standard deviation is
λ
0.5
1
. If you don’t get this, something is wrong.
3. Now, you divide all of these numbers by λ
0.5
1
, and you will have a set of numbers
whose mean is 0 and whose standard deviation is 1. Obviously, if you first
compute v
∗
1 = Av1/(||Av1||λ
0.5
1
), you just have to form the inner products of the
centered images with v
∗
1
. This is the second sanity check.
4. Repeat the above for the first k principal components, and you will have a vector
of k small numbers representing each image, and this is what you use to train
the logistic and softmax regression models.
5. For the holdout and test sets, you should first subtract the mean of the training
set images, and then project them onto the v
∗
i
vectors in the same manner. Now
the mean of the projections (the scalars you get out) over the test and hold out
set will no longer be 0, but they should be close.
9