Tasks (Classiﬁer evaluation)-Homework 2

Starting from:

$29.99

Homework 2
Instructions
You will need the following ﬁles:
Logistic regression and validation code stub : http://jmcauley.ucsd.edu/code/homework2_starter.py
Executing the code requires a working install of Python 2.7 or Python 3 with the scipy packages installed. Please include the code of (the important parts of) your solutions.
Tasks (Classiﬁer evaluation)
Similar to the classiﬁer we built in the last homework, a stub has been provided that runs a logistic regressor on the beer rating data (see link above). The stub predicts whether a beer has an ABV ≥ 6.5 based on its ﬁve rating scores:
p(positive label) = σ(θ0 + θ1 ×‘review/taste’ + θ2 ×‘review/appearance’ + θ3 ×‘review/aroma’+ θ4 ×‘review/palate’ + θ5 ×‘review/overall’) The stub runs logistic regression with a hyperparameter λ = 1.0. We will use this stub to further improve and evaluate our classiﬁer.
1. The code currently does not perform any train/test splits. Split the data into training, validation, and test sets, via 1/3, 1/3, 1/3 splits. Use the ﬁrst third, second third, and last third of the data (respectively). After training on the training set, report the accuracy of the classiﬁer on the validation and test sets (1 mark).
2. Let’s come up with a more accurate classiﬁer based on a few common words in the review. Build a feature vector to implement a classiﬁer of the form
p(positive label) = σ(θ0 + θ1 ×#‘lactic’ + θ2 ×#‘tart’...), where each feature corresponds to the number of times a particular word appears. Base your feature on the following 10 words: “lactic,” “tart,” “sour,” “citric,” “sweet,” “acid,” “hop,” “fruit,” “salt,” “spicy.” Convert the reviews to lowercase before counting.
3. Report the number of true positives, true negatives, false positives, false negatives, and the Balanced Error Rate of the classiﬁer on the test set (1 mark).
4. (Hard) Our classiﬁer is possibly less eﬀective than it could be due to the issue of class imbalance (i.e., very few of the datapoints have a positive label). Show how you would adjust the gradient ascent code provided such that the classiﬁer would be approximately ‘balanced’ between the positive and negative classes. Report the Balanced Error Rate (on the train/validation/test sets) for the new classiﬁer (1 mark).
5. Implement a training/validation/test pipeline so that you can select the best model based on its performance on the validation set. Try models with λ ∈{0,0.01,0.1,1,100}. Report the performance on the training/validation/test sets for the best value of λ (1 mark).
Tasks (Dimensionality reduction):
Next, we’ll run dimensionality reduction on the same data, using the word features from the previous question (you can drop the constant feature). Speciﬁcally we’ll try to ﬁnd the principal components of our 10 word features. For this question, use the training set constructed from the initial 1/3, 1/3, 1/3 splits of the data.
6. Find and report the PCA components (i.e., the transform matrix) using the week 3 code (1 mark).
1
7. Suppose we want to compress the data using just two PCA dimensions. How large is the reconstruction error when doing so (1 mark)?1
8. Looking at the ﬁrst two dimensions of our data in the PCA basis is an eﬀective way to ‘summarize’ the data via a 2-d plot. Using a plotting program of your choice, make a 2-d scatterplot showing the diﬀerence between ‘American IPA’ style beers versus all other styles (e.g. plot American IPAs in red and other styles in blue) (1 mark).

More products

CSE 2320 - Homework 8

$30

Add to cart

CSE 2320 - Homework 6

$30

Add to cart

CSE 2320 - Homework 5

$30

Add to cart