$25
CS7643: Deep Learning
Problem Set 0
Instructions
1. We will be using Gradescope to collect your assignments. Please read the following instructions
for submitting to Gradescope carefully! Failure to follow these instructions may result in parts
of your assignment not being graded. We will not entertain regrading requests for failure to
follow instructions.
• For Section 1: Multiple Choice Questions, it is mandatory to use the LATEX template
provided on the class webpage (https://www.cc.gatech.edu/classes/AY2020/cs7643_
fall/assets/ps0.zip). For every question, there is only one correct answer. To mark
the correct answer, change \choice to \CorrectChoice
• For Section 2: Proofs, each problem/sub-problem is in its own page. This section has 5
total problems/sub-problems, so you should have 5 pages corresponding to this section.
Your answer to each sub-problem should fit in its corresponding page.
• For Section 2, LATEX’d solutions are strongly encouraged (solution template available at
https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/assets/ps0.zip), but scanned
handwritten copies are acceptable. If you scan handwritten copies, please make sure to
append them to the pdf generated by LATEX for Section 1.
2. Hard copies are not accepted.
Exception: PS0 is meant to serve as a background preparation test. You must
NOT collaborate on PS0.
1
1 Multiple Choice Questions
1. (1 point) true/false We are machine learners with a slight gambling problem (very different
from gamblers with a machine learning problem!). Our friend, Bob, is proposing the following
payout on the roll of a dice:
payout =
?
$1 x = 1
−$1/4 x 6= 1 (1)
where x ∈ {1, 2, 3, 4, 5, 6} is the outcome of the roll, (+) means payout to us and (−) means
payout to Bob. Is this a good bet i.e are we expected to make money?
True False
2. (1 point) X is a continuous random variable with the probability density function:
p(x) = ?
4x 0 ≤ x ≤ 1/2
−4x + 4 1/2 ≤ x ≤ 1
(2)
Which of the following statements are true about equation for the corresponding cumulative
density function (cdf) C(x)?
[Hint: Recall that CDF is defined as C(x) = P r(X ≤ x).]
C(x) = 2x
2
for 0 ≤ x ≤ 1/2
C(x) = −2x
2 + 4x − 3/2 for 1/2 ≤ x ≤ 1
All of the above
None of the above
3. (2 point) A random variable x in standard normal distribution has following probability density
p(x) = 1
√
2π
e
− x
2
2 (3)
Evaluate following integral
Z∞
−∞
p(x)(ax2 + bx + c)dx (4)
[Hint: We are not sadistic (okay, we’re a little sadistic, but not for this question). This is not
a calculus question.]
a + b + c c a + c b + c
2
4. (2 points) Consider the following function of x = (x1, x2, x3, x4, x5, x6):
f(x) = σ
?
log ?
5
?
max{x1, x2} · x3
x4
− (x5 + x6)
?? +
1
2
?
(5)
where σ is the sigmoid function
σ(x) = 1
1 + e−x
(6)
Compute the gradient ∇xf(·) and evaluate it at at xˆ = (5, −1, 6, 12, 7, −5).
0.157
0.0
0.131
−0.065
−0.846
−0.846
0.157
0
0.131
−0.065
−0.314
−0.314
0.031
0
0.026
−0.013
−0.062
−0.062
−0.468
0
−0.390
0.195
0.937
0.937
5. (2 points) Which of the following functions are convex?
||x|| 1
2
mini a
T
i x for x ∈ R
n
log (1 + exp(wT xi)) for w ∈ R
d
All of the above
6. (2 points) Suppose you want to predict an unknown value Y ∈ R, but you are only given a
sequence of noisy observations x1...xn of Y with i.i.d. noise (xi = Y + ?i).. If we assume the
noise is I.I.D. Gaussian (?i ∼ N(0, σ2
)), the maximum likelihood estimate for Y can be given
by:
A: yˆ = argminy
Pn
i=1(y − xi)
2
B: yˆ = argminy
Pn
i=1 |y − xi
|
C: yˆ =
1
n
Pn
i=1 xi
Both A & C
Both B & C
3
2 Proofs
7. (3 points) Prove that
loge x ≤ x − 1, ∀x 0 (7)
with equality if and only if x = 1.
[Hint: Consider differentiation of log(x) − (x − 1) and think about concavity/convexity and
second derivatives.]
4
8. (6 points) Consider two discrete probability distributions p and q over k outcomes:
X
k
i=1
pi =
X
k
i=1
qi = 1 (8a)
pi 0, qi 0, ∀i ∈ {1, . . . , k} (8b)
The Kullback-Leibler (KL) divergence (also known as the relative entropy) between these
distributions is given by:
KL(p, q) = X
k
i=1
pi
log ?
pi
qi
?
(9)
It is common to refer to KL(p, q) as a measure of distance (even though it is not a proper metric). Many algorithms in machine learning are based on minimizing KL divergence between
two probability distributions. In this question, we will show why this might be a sensible
thing to do.
[Hint: This question doesn’t require you to know anything more than the definition of KL(p, q)
and the identity in Q7]
(a) Using the results from Q7, show that KL(p, q) is always positive.
5
(b) When is KL(p, q) = 0?
6
(c) Provide a counterexample to show that the KL divergence is not a symmetric function
of its arguments: KL(p, q) 6= KL(q, p)
7
9. (6 points) In this question, you will prove that cross-entropy loss for a softmax classifier
is convex in the model parameters, thus gradient descent is guaranteed to find the optimal
parameters. Formally, consider a single training example (x, y). Simplifying the notation
slightly from the implementation writeup, let
z = Wx + b, (10)
pj =
e
zj
P
k
e
zk
, (11)
L(W) = − log (py) (12)
Prove that L(·) is convex in W.
[Hint: One way of solving this problem is “brute force” with first principles and Hessians.
There are more elegant solutions.]
8