Starting from:

$35

Assignment 3 Finding Similar Items

CPSC 340 Assignment 3
Instructions
Rubric: {mechanics:5}
IMPORTANT!!! Before proceeding, please carefully read the general homework instructions
at https://www.cs.ubc.ca/~fwood/CS340/homework/. The above 5 points are for following the submission
instructions perfectly.
We use blue to highlight the deliverables that you must answer/do/submit with the assignment.
1 Finding Similar Items
For this question we’ll use the Amazon product data set1
from http://jmcauley.ucsd.edu/data/amazon/.
We will focus on the “Patio, Lawn, and Garden” section. You should start by downloading the ratings at
https://stanford.io/2Q7QTvu and place the file in your data directory with the original filename. Once
you do that, running python main.py -q 1 should do the following steps:
• Load the raw ratings data set into a Pandas dataframe.
• Construct the user-product matrix as a sparse matrix (to be precise, a scipy.sparse.csr_matrix).
• Create bi-directional mappings from the user ID (e.g. “A2VNYWOPJ13AFP”) to the integer index
into the rows of X.
• Create bi-directional mappings from the item ID (e.g. “0981850006”) to the integer index into the
columns of X.
1.1 Exploratory data analysis
1.1.1 Most popular item
Rubric: {code:1}
Find the item with the most total stars. Submit the product name and the number of stars.
Note: once you find the ID of the item, you can look up the name by going to the url
https://www.amazon.com/dp/ITEM_ID, where ITEM_ID is the ID of the item. For example, the URL for
item ID “B00CFM0P7Y” is https://www.amazon.com/dp/B00CFM0P7Y.
1.1.2 User with most reviews
Rubric: {code:1}
Find the user who has rated the most items, and the number of items they rated.
1The author of the data set has asked for the following citations: (1) Ups and downs: Modeling the visual evolution of
fashion trends with one-class collaborative filtering. R. He, J. McAuley. WWW, 2016, and (2) Image-based recommendations
on styles and substitutes. J. McAuley, C. Targett, J. Shi, A. van den Hengel. SIGIR, 2015.
1
1.1.3 Histograms
Rubric: {code:2}
Make the following histograms:
1. The number of ratings per user
2. The number of ratings per item
3. The ratings themselves
Note: for the first two, use plt.yscale(’log’, nonposy=’clip’) to put the histograms on a log-scale.
Also, you can use X.getnnz to get the total number of nonzero elements along a specific axis.
1.2 Finding similar items with nearest neighbours
Rubric: {code:6}
We’ll use scikit-learn’s neighbors.NearestNeighbors object to find the items most similar to the example
item above, namely the Brass Grill Brush 18 Inch Heavy Duty and Extra Strong, Solid Oak Handle, at URL
https://www.amazon.com/dp/B00CFM0P7Y.
Find the 5 most similar items to the Grill Brush using the following metrics:
1. Euclidean distance (the NearestNeighbors default)
2. Normalized Euclidean distance (you’ll need to do the normalization)
3. Cosine similarity (by setting metric=’cosine’)
Some notes/hints...
• If you run python main.py -q 1.2, it will grab the row of X associated with the grill brush. The
mappers take care of going back and forther between the IDs (like “B00CFM0P7Y”) and the indices
of the sparse array (0, 1, 2, . . .).
• Keep in mind that scikit-learn’s NearestNeighbors is for taking neighbors across rows, but here we’re
working across columns.
• Keep in mind that scikit-learn’s NearestNeighbors will include the query item itself as one of the
nearest neighbours if the query item is in the “training set”.
• Normalizing the columns of a matrix would usually be reasonable to implement, but because X is stored
as a sparse matrix it’s a bit more of a mess. Therefore, use sklearn.preprocessing.normalize to
help you with the normalization in part 2.
Did normalized Euclidean distance and cosine similarity yields the same similar items, as expected?
1.3 Total popularity
Rubric: {reasoning:2}
For both Euclidean distance and cosine similarity, find the number of reviews for each of the 5 recommended
items and report it. Do the results make sense given what we discussed in class about Euclidean distance
vs. cosine similarity and popular items?
Note: in main.py you are welcome to combine this code with your code from the previous part, so that you
don’t have to copy/paste all that code in another section of main.py.
2
2 Matrix Notation and Minimizing Quadratics
2.1 Converting to Matrix/Vector/Norm Notation
Rubric: {reasoning:3}
Using our standard supervised learning notation (X, y, w) express the following functions in terms of vectors,
matrices, and norms (there should be no summations or maximums).
1. maxi∈{1,2,...,n} |w
T xi − yi
|.
2. Pn
i=1 vi(w
T xi − yi)
2 +
λ
2
Pd
j=1 w
2
j
.
3. Pn
i=1 |w
T xi − yi
|
?2
+
1
2
Pd
j=1 λj |wj |.
Note that in part 2 we give a weight vi to each training example and the value λ is a non-negative scalar,
whereas in part 3 we are regularizing the parameters with different weights λj . You can use V to denote a
diagonal matrix that has the values vi along the diagonal, and Λ as a diagonal matrix that has the λj values
along the diagonal. You can assume that all the vi and λi values are non-negative.
2.2 Minimizing Quadratic Functions as Linear Systems
Rubric: {reasoning:3}
Write finding a minimizer w of the functions below as a system of linear equations (using vector/matrix
notation and simplifying as much as possible). Note that all the functions below are convex so finding a w
with ∇f(w) = 0 is sufficient to minimize the functions (but show your work in getting to this point).
1. f(w) = 1
2
kw − vk
2
(projection of v onto real space).
2. f(w) = 1
2
kXw − yk
2 +
1
2w
TΛw (least squares with weighted regularization).
3. f(w) = 1
2
Pn
i=1 vi(w
T xi − yi)
2 +
λ
2
kw − w
0k
2
(weighted least squares shrunk towards non-zero w
0
).
Above we assume that v and w
0 are d by 1 vectors (in part 3 v is a vector of length n by 1), and Λ is a d by
d diagonal matrix (with positive entries along the diagonal). You can use V as a diagonal matrix containing
the vi values along the diagonal.
Hint: Once you convert to vector/matrix notation, you can use the results from class to quickly compute
these quantities term-wise. As a sanity check for your derivation, make sure that your results have the right
dimensions. As a sanity check, make that the dimensions match for all quantities/operations: in order to
make the dimensions match for some parts you may need to introduce an identity matrix. For example,
XT Xw + λw can be re-written as (XT X + λI)w.
3 Robust Regression and Gradient Descent
If you run python main.py -q 3, it will load a one-dimensional regression dataset that has a non-trivial
number of ‘outlier’ data points. These points do not fit the general trend of the rest of the data, and pull
the least squares model away from the main downward trend that most data points exhibit:

0.0 0.2 0.4 0.6 0.8 1.0
5
0
5
10
15
20
25
Least Squares
Note: we are fitting the regression without an intercept here, just for simplicity of the homework question.
In reality one would rarely do this. But here it’s OK because the “true” line passes through the origin (by
design). In Q?? we’ll address this explicitly.
3.1 Weighted Least Squares in One Dimension
Rubric: {code:3}
One of the most common variations on least squares is weighted least squares. In this formulation, we have
a weight vi for every training example. To fit the model, we minimize the weighted squared error,
f(w) = 1
2
Xn
i=1
vi(w
T xi − yi)
2
.
In this formulation, the model focuses on making the error small for examples i where vi
is high. Similarly,
if vi
is low then the model allows a larger error. Note: these weights vi (one per training example) are
completely different from the model parameters wj (one per feature), which, confusingly, we sometimes also
call ”weights”.
Complete the model class, WeightedLeastSquares, that implements this model (note that Q2.2.3 asks you
to show how a few similar formulation can be solved as a linear system). Apply this model to the data
containing outliers, setting v = 1 for the first 400 data points and v = 0.1 for the last 100 data points (which
are the outliers). Hand in your code and the updated plot.
3.2 Smooth Approximation to the L1-Norm
Rubric: {reasoning:3}
Unfortunately, we typically do not know the identities of the outliers. In situations where we suspect that
there are outliers, but we do not know which examples are outliers, it makes sense to use a loss function that
4
is more robust to outliers. In class, we discussed using the sum of absolute values objective,
f(w) = Xn
i=1
|w
T xi − yi
|.
This is less sensitive to outliers than least squares, but it is non-differentiable and harder to optimize.
Nevertheless, there are various smooth approximations to the absolute value function that are easy to
optimize. One possible approximation is to use the log-sum-exp approximation of the max function2
:
|r| = max{r, −r} ≈ log(exp(r) + exp(−r)).
Using this approximation, we obtain an objective of the form
f(w)=Xn
i=1
log
exp(w
T xi − yi) + exp(yi − w
T xi)
?
.
which is smooth but less sensitive to outliers than the squared error. Derive the gradient ∇f of this function
with respect to w. You should show your work but you do not have to express the final result in matrix
notation.
3.3 Robust Regression
Rubric: {code:3}
The class LinearModelGradient is the same as LeastSquares, except that it fits the least squares model
using a gradient descent method. If you run python main.py -q 3.3 you’ll see it produces the same fit as
we obtained using the normal equations.
The typical input to a gradient method is a function that, given w, returns f(w) and ∇f(w). See funObj
in LinearModelGradient for an example. Note that the fit function of LinearModelGradient also has
a numerical check that the gradient code is approximately correct, since implementing gradients is often
error-prone.3
An advantage of gradient-based strategies is that they are able to solve problems that do not have closedform solutions, such as the formulation from the previous section. The class LinearModelGradient has
most of the implementation of a gradient-based strategy for fitting the robust regression model under the
log-sum-exp approximation. The only part missing is the function and gradient calculation inside the funObj
code. Modify funObj to implement the objective function and gradient based on the smooth approximation
to the absolute value function (from the previous section). Hand in your code, as well as the plot obtained
using this robust regression approach.
4 Linear Regression and Nonlinear Bases
In class we discussed fitting a linear regression model by minimizing the squared error. In this question,
you will start with a data set where least squares performs poorly. You will then explore how adding a
bias variable and using nonlinear (polynomial) bases can drastically improve the performance. You will also
explore how the complexity of a basis affects both the training error and the test error.
4.1 Adding a Bias Variable
Rubric: {code:3}
If you run python main.py -q 4, it will:
2Other possibilities are the Huber loss, or |r| ≈ √
r
2 + ? for some small ?.
3Sometimes the numerical gradient checker itself can be wrong. See CPSC 303 for a lot more on numerical differentiation.

1. Load a one-dimensional regression dataset.
2. Fit a least-squares linear regression model.
3. Report the training error.
4. Report the test error (on a dataset not used for training).
5. Draw a figure showing the training data and what the linear model looks like.
Unfortunately, this is an awful model of the data. The average squared training error on the data set is over
28000 (as is the test error), and the figure produced by the demo confirms that the predictions are usually
nowhere near the training data:
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
200
100
0
100
200
300
Least Squares, no bias
The y-intercept of this data is clearly not zero (it looks like it’s closer to 200), so we should expect to improve
performance by adding a bias (a.k.a. intercept) variable, so that our model is
yi = w
T xi + w0.
instead of
yi = w
T xi
.
In file linear_model.py, complete the class, LeastSquaresBias, that has the same input/model/predict
format as the LeastSquares class, but that adds a bias variable (also called an intercept) w0 (also called β
in lecture). Hand in your new class, the updated plot, and the updated training/test error.
Hint: recall that adding a bias w0 is equivalent to adding a column of ones to the matrix X. Don’t forget
that you need to do the same transformation in the predict function.
4.2 Polynomial Basis
Rubric: {code:4}
Adding a bias variable improves the prediction substantially, but the model is still problematic because the
target seems to be a non-linear function of the input. Complete LeastSquarePoly class, that takes a data
vector x (i.e., assuming we only have one feature) and the polynomial order p. The function should perform
a least squares fit based on a matrix Z where each of its rows contains the values (xi)
j
for j = 0 up to p.
6
E.g., LeastSquaresPoly.fit(x,y) with p = 3 should form the matrix
Z =





1 x1 (x1)
2
(x1)
3
1 x2 (x2)
2
(x2)
3
.
.
.
1 xn (xn)
2
(xN )
3





,
and fit a least squares model based on it. Hand in the new class, and report the training and test error for
p = 0 through p = 10. Explain the effect of p on the training error and on the test error.
Note: you should write the code yourself; don’t use a library like sklearn’s PolynomialFeatures.
5 Very-Short Answer Questions
Rubric: {reasoning:7}
1. Suppose that a training example is global outlier, meaning it is really far from all other data points. How
is the cluster assignment of this example by k-means? And how is it set by density-based clustering?
2. Why do need random restarts for k-means but not for density-based clustering?
3. Can hierarchical clustering find non-convex clusters?
4. For model-based outlier detection, list an example method and problem with identifying outliers using
this method.
5. For graphical-based outlier detection, list an example method and problem with identifying outliers
using this method.
6. For supervised outlier detection, list an example method and problem with identifying outliers using
this method.
7. If we want to do linear regression with 1 feature, explain why it would or would not make sense to use
gradient descent to compute the least squares solution.
8. Why do we typically add a column of 1 values to X when we do linear regression? Should we do this
if we’re using decision trees?
9. If a function is convex, what does that say about stationary points of the function? Does convexity
imply that a stationary points exists?
10. Why do we need gradient descent for the robust regression problem, as opposed to just using the
normal equations? Hint: it is NOT because of the non-differentiability. Recall that we used gradient
descent even after smoothing away the non-differentiable part of the loss.
11. What is the problem with having too small of a learning rate in gradient descent?
12. What is the problem with having too large of a learning rate in gradient descent?
13. What is the purpose of the log-sum-exp function and how is this related to gradient descent?
14. What type of non-linear transform might be suitable if we had a periodic function?
7

More products