Starting from:

$30

Homework 3 - Version 1.3

CSC413/2516 
Homework 3 - Version 1.3

Submission: You must submit your solutions as a PDF file through MarkUs1
. You can produce
the file however you like (e.g. LaTeX, Microsoft Word, scanner), as long as it is readable.
See the syllabus on the course website2
for detailed policies. You may ask questions about the
assignment on Piazza3
. Note that 10% of the homework mark (worth 1 pt) may be removed for a
lack of neatness.
The teaching assistants for this assignment are Cem Anil and Sheng Jia.
mailto:csc413-2020-01-tas@cs.toronto.edu
Clarifications
All the modifications to the original assignment will be marked with red in the main body of the
assignment. A complete list of the modifications can be found below:
• Question 1: There was a missing square in the first term of the loss function. This missing
square was added.
• Question 1: The target variable t was referred to as Y once - this was corrected.
• Question 2: The y in generalization error is supposed to be the noisy version instead of
y∗(x) but this shouldn’t affect how you solve the question since the right hand side was correct
(bias and variance terms)
• Question 3: The regressions coefficients w1 and w2 were mistakenly referred to as alphas
in the hint of Q3.1.1. This was corrected.
• Question 4: A new hint was added.
1
https://markus.teach.cs.toronto.edu/csc413-2020-01
2
https://csc413-2020.github.io/assets/misc/syllabus.pdf
3
https://piazza.com/class/k58ktbdnt0h1wx?cid=1
1
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 3
1 Weight Decay
Here, we will develop further intuitions on how adding weight decay can influence the solution
space. For a refresher on generalization, please refer to: https://csc413-2020.github.io/
assets/readings/L07.pdf. Consider the following linear regression model with weight decay.
J (wˆ ) = 1
2n
kXwˆ − tk
2
2 +
λ
2

>wˆ
where X ∈ R
n×d
, ❅❅Y t ∈ R
n
, and wˆ ∈ R
d
. n is the number of data points and d is the data
dimension. X is the design matrix in HW1.
1.1 Underparameterized Model [0pt]
First consider the underparameterized d ≤ n case. Write down the solution obtained by gradient
descent assuming training converges. Is the solution unique? If the solution involves inverting
matrices, explain why it is invertible.
1.2 Overparameterized Model
1.2.1 Warmup: Visualizing Weight Decay [1pt]
Now consider the overparameterized d > n case. We start with a 2D example from HW1. For a
single training example x1 = [2, 1] and t1 = 2. First, 1) draw the solution space of the squared
error on a 2D plane. Then, 2) draw the the coutour plot of the weight decay term λ
2 wˆ
>wˆ .
Include the plot in the report. Also indicate on the plot where the gradient descent solutions
are with and without weight decay. (Precious drawings are not required for the full mark.)
1.2.2 Gradient Descent and Weight Decay [0pt]
Derive the solution obtained by gradient descent at convergence in the overparameterized case. Is
this the same solution from Homework1 3.4.1?
1.3 Adaptive optimizer and Weight Decay [1pt]
In HW2 Section 1.2, we saw that per-parameter adaptive methods, such as AdaGrad, Adam, do
not converge to the least norm solution due to moving out of the row space of our design matrix
X.
Assume AdaGrad converges to an optimal in the training objective. Does weight decay help
AdaGrad converge to a solution in the row space? Give a brief justification.
(Hint: build intuition from the 2-D toy example.)
2 Ensembles and Bias-variance Decomposition
In the prerequisite CSC311 https://amfarahmand.github.io/csc311/lectures/lec04.pdf, we
have seen the bias-variance decomposition. The following question uses the same notation as taught
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 3
in CSC311. As a reminder, the expected generalization error is the following:
E
h
|h(x; D) − y|
2
i
| {z }
1 generalization err
= E
h
|E[h(x; D)| x] − y∗(x)|
2
i
| {z }
2 bias
+ E
h
|h(x; D) − E[h(x; D)| x]|
2
i
| {z }
3 variance
+ E
h
|y∗(x) − y|
2
i
| {z }
Irreducible error
where x, y represent the sampled test data. D represents the sampled training dataset. h(·; D)
is our prediction model learned using this dataset (i.e. h(·; D) is the learnt hypothesis given the
training examples). y∗(x) is the true model we want to learn and E[y|x] = y∗(x). In the following
questions, we are interested in the generalization performance of ensemble.
2.1 Weight Average or Prediction Average?
2.1.1 [1pt]
Does the ensemble of linear models using weight average or prediction average give the same expected generalization error? Provide a mathematical justification.
2.1.2 [0pt]
Does the ensemble of (nonlinear) neural networks using weight average or prediction average give
the same expected generalization error? Provide a mathematical justification.
2.2 Bagging - Uncorrelated Models
One way to construct an ensemble is through bootstrap aggregation, or bagging, that takes a dataset
D and generates k new datasets, with replacement. In this question, we assume the generated
dataset Di has the same size as D. Then train a model for each dataset, Di
, resulting in k models.
The ensemble model is following:
h¯(x; D) = 1
k
X
k
i=1
h(x; Di)
For this part, we will make a very unrealistic assumption that the predictions of the ensemble
members are uncorrelated. That is Cov(h(x; Dj ), h(x; Dk)) = 0.
2.2.1 Bias with bagging [0pt]
Show that ensemble does not change the bias term in the generalization error.
Show bias = E

 E

h¯(x; D)| x

− y∗(x)
 
 
2
i
= E
h
|E[h(x; D)| x] − y∗(x)|
2
i
2.2.2 Variance with bagging [1pt]
Assume the variance of a single predictor is σ
2
, E
h
|h(x; D) − E[h(x; D)| x]|
2
i
= σ
2
. Derive the
variance term of ensemble in terms of σ under uncorrelated predictors.
3
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 3
2.3 Bagging - General Case
In practice, there will be correlations among the k predictions of the ensemble members because the
sampled training datasets would be very similar to each other. For simplicity, assume a non-zero
pairwise correlation between the ensemble members, that is ρ. The variance of the predictor for
h(x; Dj ) is σ
2
j
.
ρ =
Cov(h(x; Dj ), h(x; Dk))
σjσk
∀j 6= k
2.3.1 Bias under Correlation [1pt]
Does the correlation change the bias term in the generalization error? If so, derive the new expression in terms of ρ. Provide your justification.
2.3.2 Variance under Correlation [0pt]
Assume the variance of a single predictor is σ
2
. Derive the variance term of ensemble in terms of
σ and ρ.
Show variance = E

 h¯(x; D) − E

h¯(x; D)|x
 
 
2
i
=

ρ +
1 − ρ
k

σ
2
2.3.3 Intuitions on bagging [1pt]
Based on the derived variance, what happens to the variance when you increase the number of
ensemble models k? What do ρ = 0, ρ = 1 represent and their consequences for the variance?
4
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 3
3 Generalization and Dropout
In this question, we will investigate the effect of using dropout on generalization.
Lets say we generate triples (X1, X2, Y ) according to the following model:
X1 ← Gaussian(0, σ2
) (3.1)
Y ← X1 + Gaussian(0, σ2
) (3.2)
X2 ← Y + Gaussian(0, 1) (3.3)
3.1 Regression Coefficients
We will try to predict Y from (X1, X2) using a linear least-square predictor. To be specific, we’ll
try to minimize J =
1
2N
PN
1
(y
(i) − yˆ
(i)
)
2 where ˆy = w1x1 + w2x2 and N is the size of the samples
in our training set. In the remainder of the question, you can assume that we have infinitely many
samples - in this case, we can write the loss as J = E(x1,x2,y)∼(X1,X2,Y )
[(y
(i) − yˆ
(i)
)
2
]
3.1.1 Regress from X1 [0pt]
Find the value of w1 if we’re only using X1 to predict Y . Your answer might depend on σ.
Hints: You may consider taking the following steps:
• Step 1: Write out the error as an expectation under the random variable X1, X2 and Y .
• Step 2: Enter the equations from the structural equation model into the error expression.
• Step 3: Take the square, and separate the expectations. Many terms cancel due to independence.
• Step 4: Differentiate the final expression wrt. w1 and w2 and set it to 0. Solve for the optimal
values of ✘alpha ✘❳✘❳ ❳ w1 and w2.
3.1.2 Regress from X2 [1pt]
Find the value of w2 if we’re only using X2 to predict Y . Your answer might depend on σ.
3.1.3 Regress from (X1, X2) [1pt]
1) Find (w1, w2) if we’re using (X1, X2) to predict Y in terms of σ. 2) Comment on if this maximum
likelihood solution will generalize well during the test time if σ changes.
(Hint: think about the causal relationship of the random variables. What happens if σ becomes
smaller during the test time)
3.1.4 Different σs. [0pt]
Now assume half of the dataset was sampled using σ = σ1 and the other half was sampled using
σ = σ2. How does the coefficients computed in 3.1.3 change?
5
CSC413/2516 Winter 2020 with Professor Jimmy Ba Homework 3
3.2 Dropout as Data-Dependent L2 Regularization
Lets say we now apply dropout to X1 and X2. That is, we now have ˆy = 2(m1w1x1 + m2w2x2)
where both m1 and m2 are iid. Bernoulli variables with expectation 1
2
. Note that since the dropout
mask is a random variable, the output of our model is also a random variable. You may find Slide 21
in Lecture 7 https://csc413-2020.github.io/assets/readings/L07.pdf helpful in answering
this question. 3.2.1 is no-credit review question whose answers can be found directly in slides.
3.2.1 Expectation and variance of predictions [0pt]
Show that E[ˆy] = w1x1 + w2x2 and V ar[ˆy] = w
2
1x
2
1 + w
2x
2
2
.
3.3 Effect on Dropout [1pt]
If we’re using both (X1, X2) to predict Y, how does applying dropout change the optimal parameters? Will this solution generalize better than 3.1.3? Why?
(Hint: try directly minimizing the expected loss under dropout, the last equation on Slide 21
https://csc413-2020.github.io/assets/readings/L07.pdf. )
4 Hard-Coding Recurrent Neural Networks [1pt]
Consider solving the task of binary addition. We can train a feedforward net to do this, but there
are some obvious limitations in the MLP solution. Namely, 1) we must decide in advance the
maximum number of digits in each number. 2) The processing applied to the beginning of a long
number does not generalize to the end of the long number because it uses different weights. As a
result, it is difficulty to learn an MLP to generalize beyond the fixed length training examples in
the dataset.
In this problem, you need to find a set of weights and biases for a recurrent neural network
which adds two list of numbers. At each time step t, the network receives two binary inputs,
x
t
1
, xt
2 ∈ {0, 1}, and has a sigmoid output at each time step. Please design a recurrent neural
network that correctly implements the binary addition for any sequence length of binary inputs.
(Hint: You might want to search up ‘binary full adder’ in Google for inspiration. )(Hint: You might
want to search up ‘binary full adder’ in Google for inspiration. )
6

More products