Starting from:

$35

Assignment 5 Kernel Logistic Regresion

CPSC 340 Assignment 5 
Instructions
Rubric: {mechanics:5}
IMPORTANT!!! Before proceeding, please carefully read the general homework instructions
at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission
instructions perfectly.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
1 Kernel Logistic Regresion
If you run python main.py -q 1 it will load a synthetic 2D data set, split it into train/validation sets, and
then perform regular logistic regression and kernel logistic regression (both without an intercept term, for
simplicity). You’ll observe that the error values and plots generated look the same since the kernel being
used is the linear kernel (i.e., the kernel corresponding to no change of basis).
1.1 Implementing kernels
Rubric: {code:5}
Implement the polynomial kernel and the RBF kernel for logistic regression. Report your training/validation
errors and submit the plots generated for each case. You should use the hyperparameters p = 2 and σ = 0.5
respectively, and λ = 0.01 for the regularization strength.
1.2 Hyperparameter search
Rubric: {code:3}
For the RBF kernel logistic regression, consider the hyperparameters values σ = 10m for m = −2, −1, . . . , 2
and λ = 10m for m = −4, −3, . . . , 0. In main.py, sweep over the possible combinations of these hyperparameter values. Report the hyperparameter values that yield the best training error and the hyperparameter
values that yield the best validation error. Include the plot for each.
Note: on the job you might choose to use a tool like scikit-learn’s GridSearchCV to implement the grid
search, but here we are asking you to implement it yourself by looping over the hyperparameter values.
1.3 Reflection
Rubric: {reasoning:1}
Briefly discuss the best hyperparameters you found in the previous part, and their associated plots. Was the
training error minimized by the values you expected, given the ways that σ and λ affect the fundamental
tradeoff?
1
2 MAP Estimation
Rubric: {reasoning:8}
In class, we considered MAP estimation in a regression model where we assumed that:
• The likelihood p(yi
| xi
, w) is a normal distribution with a mean of w
T xi and a variance of 1.
• The prior for each variable j, p(wj ), is a normal distribution with a mean of zero and a variance of
λ
−1
.
Under these assumptions, we showed that this leads to the standard L2-regularized least squares objective
function,
f(w) = 1
2
kXw − yk
2 +
λ
2
kwk
2
,
which is the negative log likelihood (NLL) under these assumptions (ignoring an irrelevant constant). For
each of the alternate assumptions below, show how the loss function would change (simplifying as much as
possible):
1. We use a Laplace likelihood with a mean of w
T xi and a scale of 1, and we use a zero-mean Gaussian
prior with a variance of σ
2
.
p(yi
| xi
, w) = 1
2
exp(−|w
T xi − yi
|), p(wj ) = 1

2πσ
exp

w
2
j

2
!
.
2. We use a Gaussian likelihood where each datapoint has its own variance σ
2
i
, and where we use a
zero-mean Laplace prior with a vairance of λ
−1
.
p(yi
| xi
, w) = 1
p

2
i π
exp ?

(w
T xi − yi)
2

2
i
?
, p(wj ) = λ
2
exp(−λ|wj |).
You can use Σ as a diagonal matrix that has the values σ
2
i
along the diagonal.
3. We use a (very robust) student t likelihood with a mean of w
T xi and ν degrees of freedom, and a
zero-mean Gaussian prior with a variance of λ
−1
,
p(yi
|xi
, w) = Γ

ν+1
2
?

νπΓ

ν
2
?
?
1 +
(w
T xi − yi)
2
ν
?− ν+1
2
, p(wj ) =

λ


exp
−λ
w
2
j
2
!
.
where Γ is the “gamma” function (which is always non-negative).
4. We use a Poisson-distributed likelihood (for the case where yi represents counts), and we use a uniform
prior for some constant κ,
p(yi
|w
T xi) = exp(yiw
T xi) exp(− exp(w
T xi))
yi
!
, p(wj ) ∝ κ.
(This prior is “improper” since w ∈ R
d but it doesn’t integrate to 1 over this domain, but nevertheless
the posterior will be a proper distribution.)
3 Principal Component Analysis
Rubric: {reasoning:3}
Consider the following dataset, containing 5 examples with 2 features each:
x1 x2
-4 3
0 1
-2 2
4 -1
2 0
Recall that with PCA we usually assume that the PCs are normalized (kwk = 1), we need to center the
data before we apply PCA, and that the direction of the first PC is the one that minimizes the orthogonal
distance to all data points.
1. What is the first principal component?
2. What is the reconstruction loss (L2 norm squared) of the point (-3, 2.5)? (Show your work.)
3. What is the reconstruction loss (L2 norm squared) of the point (-3, 2)? (Show your work.)
Hint: it may help (a lot) to plot the data before you start this question.
4 PCA Generalizations
4.1 Robust PCA
Rubric: {code:10}
If you run python main -q 4.1 the code will load a dataset X where each row contains the pixels from a
single frame of a video of a highway. The demo applies PCA to this dataset and then uses this to reconstruct
the original image. It then shows the following 3 images for each frame:
1. The original frame.
2. The reconstruction based on PCA.
3. A binary image showing locations where the reconstruction error is non-trivial.
Recently, latent-factor models have been proposed as a strategy for “background subtraction”: trying to
separate objects from their background. In this case, the background is the highway and the objects are the
cars on the highway. In this demo, we see that PCA does an OK job of identifying the cars on the highway
in that it does tend to identify the locations of cars. However, the results aren’t great as it identifies quite
a few irrelevant parts of the image as objects.
Robust PCA is a variation on PCA where we replace the L2-norm with the L1-norm,
f(Z, W) = Xn
i=1
X
d
j=1
|hw
j
, zii − xij |,
and it has recently been proposed as a more effective model for background subtraction. Complete the class
pca.RobustPCA, that uses a smooth approximation to the absolute value to implement robust PCA. Briefly
comment on the results.
Note: in its current state, pca.RobustPCA is just a copy of pca.AlternativePCA, which is why the two rows
of images are identical.
Hint: most of the work has been done for you in the class pca.AlternativePCA. This work implements an
alternating minimization approach to minimizing the (L2) PCA objective (without enforcing orthogonality).
This gradient-based approach to PCA can be modified to use a smooth approximation of the L1-norm. Note
3
that the log-sum-exp approximation to the absolute value may be hard to get working due to numerical
issues, and a numerically-nicer approach is to use the “multi-quadric” approximation:
|α| ≈ p
α2 + ?,
where ? controls the accuracy of the approximation (a typical value of ? is 0.0001).
4.2 Reflection
Rubric: {reasoning:3}
1. Briefly explain why using the L1 loss might be more suitable for this task than L2.
2. How does the number of video frames and the size of each frame relate to n, d, and/or k?
3. What would the effect be of changing the threshold (see code) in terms of false positives (cars we
identify that aren’t really there) and false negatives (real cars that we fail to identify)?
5 Very-Short Answer Questions
Rubric: {reasoning:11}
1. Assuming we want to use the original features (no change of basis) in a linear model, what is an
advantage of the “other” normal equations over the original normal equations?
2. In class we argued that it’s possible to make a kernel version of k-means clustering. What would an
advantage of kernels be in this context?
3. In the language of loss functions and regularization, what is the difference between MLE and MAP?
4. What is the difference between a generative model and a discriminative model?
5. With PCA, is it possible for the loss to increase if k is increased? Briefly justify your answer.
6. What does “label switching” mean in the context of PCA?
7. Why doesn’t it make sense to do PCA with k d?
8. In terms of the matrices associated with PCA (X, W, Z, Xˆ), where would an “eigenface” be stored?
9. What is an advantage and a disadvantage of using stochatic gradient over SVD when doing PCA?
10. Which of the following step-size sequences lead to convergence of stochastic gradient to a stationary
point?
(a) α
t = 1/t2
.
(b) α
t = 1/t.
(c) α
t = 1/

t.
(d) α
t = 1.
11. We discussed “global” vs. “local” features for e-mail classification. What is an advantge of using global
features, and what is advantage of using local features?
4

More products