CSE 512: Midterm review

Your shopping cart is empty.

CSE 512: Midterm review
1. Nomenclature Define the following terms
(a) model vs loss function vs data
(b) prior vs posterior vs likelihood
(c) MLE vs MAP
(d) Expected risk vs empirical risk
(e) prediction, forecasting
(f) PAC learning
(g) biased vs unbiased estimator
(h) Bayes risk, minimax risk
2. Linear models for classification
(a) Given the following datasets, would you propose a linear, generalized linear, or neither model to predict y
given x? Jusify your answer.
i. Temperature (x) 10◦F 25◦F 50◦F 70◦F 100◦F
Wear a sweater? (y) yes (+1) yes (+1) no (-1) no (-1) no (-1)
ii. Time of day (x) 7h00 10h00 12h00 15h00 20h00
Wear a sweater? (y) yes (+1) no (-1) no (-1) no (-1) yes (+1)
iii. Chance of earthquake (x) 0.01% 0.1% 1% 10% 100%
Wear a sweater? (y) yes (+1) no (-1) yes (+1) no (-1) no (-1)
iv.
Commute speed (km/hr) (x) walking (1) jogging (5) biking (20) horse and buggy (30) driving (60)
Wear a sweater? (y) yes (+1) yes (-1) yes (+1) no (-1) no (-1)
(b) For the task of predicting whether you should wear a sweater, using the features from the previous problem,
propose a way of using logistic regression to construct a model. Construct some fake data for yourself, train
the model, and use it to predict whether you should wear a sweater in the following scenarios
i. Warm day 70◦F, early morning 8h00, no chance of earthquake (0%), bringing the horse and buggy
ii. Cold day 20◦F, late afternoon 16h00, medium chance of earthquake (25%), jogging
3. Linear regression
(a) Describe a convex loss function where, when minimized, returns θ that maximizes the likelihood of the following
model:
yi − x
T
i
θ ∼ N (µ1, σ1), θ ∼ N (µ2, Σ2)
where µ1, σ1 are scalars, µ2 is a vector, and Σ2 is a PSD matrix; all 4 of these constants are known.
(b) A is a symmetric positive semidefinite matrix, and the condition number of A is ¯κ. The maximum eigenvalue
of A is λmax. Write an expression κ(ρ) that returns the condition number of A + ρI.
(c) B = XXT and the singular value decomposition of X is
X =


√
1
2
0
− √
1
2
0
0 1



2 0
0 −1

"
√
1
2
√
1
2
− √
1
2
√
1
2
#
Write an expression κ(ρ) that returns the condition number of B + ρI. What happens when ρ = 0?
1
(d) C =

1 1
1 1
. Write an expression κ(ρ) that returns the condition number of C + ρI
(e) Consider the linear regression problem
minimize
x
f(x) = kAx − bk
2
2 + ρkxk
2
2
i. First consider ρ = 0. What are the normal equations? Write them down.
ii. Now consider general ρ. Write down the normal equations again, and describe how they relate to the
gradient of the objective.
iii. Describe how you would solve for the solution x when
A. A is a wide matrix (more columns than rows) and ρ = 0
B. A is a tall matrix (more rows than columns) and has full column rank, and ρ = 0
iv. Are either case made easier of ρ > 0?
v. Write down the condition number for f(x).
(f) Suppose that I was given a set of feature/label pairs xi
, yi for i = 1, ..., m, and I wish to do linear regression
to find a model that, for a new vector x, well-approximates its corresponding value y.
But, for hardware reasons, I cannot hold onto the values xi
; instead, I hold onto the Fourier transform of xi
;
that is, II have access to ui = F xi where F is the DFT matrix.
The DFT matrix is in general super efficient to apply (it doesn’t really require a full matrix-vector multiplication). Additionally, it is a unitary matrix, so that F F T = F
T F = I.
i. Describe the inverse DFT matrix, e.g. what does F
−1
look like?
ii. Write down the least squares system we need to solve such that we retrieve u = F x, where Ax ≈ b. This
system should look like
minimize
u
kAuˆ − ˆbk
2
2
.
What are Aˆ and ˆb?
iii. Given κ the condition number of AT A, what is the condition number of AˆT Aˆ?
4. Binary classification
(a) Consider the following dataset
x[1] x[2] y
-1.0 -1.0 +1
-0.25 2.0 -1
-2.0 -0.25 +1
-0.5 0.5 -1
0.5 -1.25 -1
0.25 2.0 +1
3.0 -0.25 -1
2.5 1.0 +1
i. In words, can you describe the rule being used to generate y from x ∈ R
2
? (Hint: start by plotting the
points)
ii. Is this dataset linearly separable? Why or why not?
iii. Propose a generalized linear model using 2-order polynomials (where the highest degree is 2) that can
separate this data. For this model, compute the margin for each datapoint and report the minimum
margin.
5. Gradient descent Consider the following generalized loss function
f(θ) = 1
m
Xm
i=1
g(yix
T
i
θ)
where g : R → R is convex and differentiable everywhere, and θ ∈ R
n.
2
(a) What is the gradient and Hessian of f? What are their dimensions?
(b) Would f be convex if g were not convex? If g were concave? Why or why not?
(c) What would be good qualities to impose on g such that minimizing f produces a margin maximizing classifier?
(d) Now consider F(θ) = f(θ) + ρkθk
2
2
for some ρ > 0. What is the condition number of F (in terms of properties
of xi
, yi
, and g?
(e) Write out pseudocode implementing gradient descent for both the regularized and unregularized form. Specifically, fill in the gaps, and include how I would go about computing a step size given yi
, xi
, and g.
def grad_desc_unreg(X,y):
<fill me in >
return theta
def grad_desc_reg(X,y,rho):
<fill me in >
return theta
Assume that you have access to functions on g, namely
def g(theta):
<computes z = g(theta)>
return z
def g_grad(theta):
<computes zp = g’(theta)>
return zp
def g_hess(theta):
<computes zpp = g’’(theta)>
return zpp
6. Margins. I have 6 datapoints, plotted below.
3
You should interpret each feature vector as
xA =

0
3

, xB =

−3
1

, xC =

−2
−1

, xD =

3
1

, xE =

2
−1

, xF =

−3
0

.
The labels correspond to blue points (-1) and red points (+1), e.g.
yA = yB = yC = −1, yD = yE = yF = 1.
(a) Draw some decision boundaries. On the plot above, draw a line (solid) corresponding to the set
S1 = {x : x
T
θ1 = 0}
where θ1 =

2
−1

. Also, draw a line (dashed) corresponding to the set
S2 = {x : x
T
θ2 = 0}
where θ2 =

1
−2

.
(b) Fill in the table below with each datapoint’s margin, e.g. the distance from each feature vector to the margin:
dist to S1 dist to S2
A
B
C
D
E
F
(c) I use the usual linear predictor to deal with new points:
y = sign(θ
T x)
i. If I pick θ = θ1, which points (A,B,C,D,E,F) are my support vectors? What is this predictor’s minimum
margin?
4
ii. If I pick θ = θ2, which points (A,B,C,D,E,F) are my support vectors? What is this predictor’s minimum
margin?
iii. Which choice of θ maximizes the minimum margin?
(d) Argue that θ = θ1 is in fact the optimal margin maximizing choice. Do this in two steps:
i. Argue that changing the norm kθk2 does not affect the minimum margin.
ii. Argue that changing the rotation of θ, (e.g. θ[1]/θ[2]) will always reduce the margin to one or more of the
support vectors of θ1.
7. Support vector machines A common depiction of the SVM problem formulation is
minimize
θ∈Rn,s∈Rm
kθk
2
2 + λ
Xm
i=1
max{0, si}
subject to yix
T
i
θ = 1 − si
(1)
Suppose I solve (1) and I receive some optimal solutions for θ
∗
, s
∗
.
(a) What is the margin of the classifiers I received?
(b) Suppose m = 100 and n = 25. For my values of s, 34 of them are nonzero, and 66 of them are 0. What is an
upper bound on how many misclassified training points there are?
(c) Would you expect my number of misclassified points to increase or decrease if I increase/decrease λ?
(d) A cat walks across my keyboard and accidentally deletes θ
∗
. Am I screwed? Can I recover θ
∗
?
8. Convex sets Decide if the following sets are convex. Either prove they are, or provide a counterexample to show
they are not.
(a) null(A) = {x : Ax = 0} for some matrix A
(b) The set of intervals [a, b] = {x : a ≤ x ≤ b}
(c) The set of vectors {x ∈ R
n :
Q
i
x[i] = 0}
9. Convex functions Decide if the following functions are convex. Either prove they are, or provide a counterexample
to show they are not.
(a) f(x) = x
p
for p = 1, 2, 3, ...
(b) f(x) = x
p
for p = 1, 2, 3, ... over the domain x ≥ 0
(c) f(x) = p
|x|
(d) f(x) = σ(x) where σ(x) = 1/(1 + e
−x
)
(e) f(x) = log(x)
(f) f(x) = log(σ(x)) (for same definition of σ)
(g) f(x) = Ey[f(x, y)] where f is convex over x, for fixed y
10. Optimality. Classify x
∗ as a local min, local max, global min, global max, saddle point, stationary point, or none.
(More than one may apply.) Justify your answer
(a) f(x) = x
2 and f
0
(x) = 0
(b) f(x) is convex and f
0
(x) = 0
(c) −f(x) is convex and f
0
(x) = 0
11. Point estimation (Use of simple calculator permitted.) Recall Hoeffding’s inequality
Pr
1
m
Xm
i=1
xi − E[X] ≥
!
≤ exp(−2m2
).
Suppose I have a fair coin (50% chance of getting heads or tails) and I flip the coin m times.
5
(a) How many flips until I am 90% certain that between 25%-75% flips are heads?
(b) If I flip the coin 100 times, how certain am I that between 45 to 55 flips are tails?
(c) My boss does not believe that the coin is fair, and tells me to keep flipping the coin until I am 1% certain of
whatever I report. I flip until I get carpal tunnel syndrome, which is about 234 flips, of which 113 are heads.
What range of values for heads/tails can I report and still guarantee 99% certainty the real weighting of the
coin?
12. Biased or unbiased? In the previous homework, you worked with the exponential distribution, defined by
pλ(x) = (
λe−λx if x > 0
0 else.
In particular, you showed that, given samples x1, ..., xm drawn i.i.d. from this distribution, that ˆθ =
1
m
Pm
i=1 xi
served as both the maximum likelihood and unbiased estimator for the true mean, 1
λ
.
(a) Derive the MLE for λ.
(b) Show that this MLE is biased. Hint: f(x) = 1/x is a strictly convex function whenever x > 0.
13. Estimation and Risk analysis
I am a stock broker, and on my computer screen, I see 100 stocks. At each given day, the stocks report a return
rate (e.g. if I had invested $x in stock i, my share of that stock at the end of that day would be worth (1 + ri)$x.
I record these rates over the past 100 days, and notice that they can be modeled well by a Gaussian distribution,
with mean µi and standard deviation σi for the ith stock.
(a) Even investment Let’s assume that I decide to diversify my funds to the limit, and give an equal amount of
money (say, $10) to every stock.
i. What is the maximum likelihood estimate for my expected return that day?
ii. I decide to estimate the standard deviation of my return by using ˆσ =
qP100
i=1 σ
2
i
100 . Is this a good idea?
bad idea? Is it a maximum likelihood estimator? of what? Is it biased/unbiased?
(b) Consulting I have two clients, Alice and Bob. They both want me to recommend one stock where they will
put all their funds.
i. Using the rate of money earned each day as the reward function, what is the Bayes reward for picking
stock i?
ii. Using the rate of money earned each day as the reward function, what is the Minimax reward for picking
stock i?
iii. Alice is representing a business with tons of insurance protection. Her job is to earn as much money in the
long run. Should I recommend to her the fund that maximizes the Bayes reward or the minimax reward?
iv. Bob is investing his life savings. He really can’t afford to lose a single cent. Should I recommend to her
the fund that maximizes the Bayes reward or the minimax reward?
(c) Diversification Suppose that on the next computer screen, there are 10 additional stocks. Each stock, on
each day, has a 25% chance of doubling your investment, and a 75% chance of losing half your money. I have
$100 to invest.
i. What is your Bayes risk if you invest all your money in one stock?
ii. What is your Bayes risk if you invest your money evenly in every stock?
iii. Using Hoeffding’s inequality, what are the chances that you will lose half or more of your money in one
day if you invest in 1 stock? 10 stocks? What if there were 100 such stocks?
Hint: It’s helpful to first ask yourself if there is an implicit bound on how much money you can earn in
one day.
14. Nearest Neighbors and Bayes for regression Suppose I want to sell my house. It’s a beautiful 2 story blue
house made primarily of wood, built in 1980, and located in Martha’s Vinyard. I want to price this house reasonably,
but I’m not really sure what a good price will be. I look around and see the following other houses
6
# stories color construction material distance to my house built year sold at price
Alice’s house 2 pink wood 5 miles 2016 $ 1 million
Benjamin’s house 4 green straw next door 1776 $100
Carly’s house 1 brown concrete 2.5 miles 1980 $ 300,000
Dasha’s house 3 white wood 100 miles 1999 $150,000
Elliot’s house 1 purple brick 1800 miles 2010 $500,000
(a) Discuss how you would use a KNN regression model to pick a good price for my house. That is, design
a reasonable distance function, compute the “distances” to the features of each friend’s house, decide on a
reasonable value for K, and give a reasonable price.
(b) Now assume that I have m → +∞ friends. In this regime, I have at least one friend whose house is in the
same location as mine and has basically the exact same features. That friend sold their house for $25 million.
Does this mean that I am guaranteed to also sell it for this price? Why or why not?
(c) Suppose that house prices were dependent only on color and construction material. Discuss how you could
form a Bayes and naive Bayes regression model to figure out the maximum likelihood price that I should use.
How much training data would you need if there are 5 possible colors, 5 distinct types of construction material,
the true price is between 0 and $ million, and I want to be accurate up to the nearest $100?
7

Shopping cart

US$0

CSE 512: Midterm review

More products