$30
Stat 437 HW5
Your Name (Your student ID)
General rule
Due by 11:59pm Pacific Standard Time, May 2, 2021. Please show your work and submit your
computer codes in order to get points. Providing correct answers without supporting details does
not receive full credits. This HW covers
• support vector machines
• neural networks
• principal component analysis
You DO NOT have to submit your HW answers using typesetting software. However, your answers
must be legible for grading. Please upload your answers to the course space.
For exercises on the Text, there might be solutions to them that are posted online. Please do not
plagiarize those solutions.
Conceptual exercises: I (support vector machines)
1.1) State the mathematical definition of a hyperplane. Describe the classification rule that is induced
by a hyperplane. How does the classification rule involve the normal vector of the hyperplane? (Hint:
you can use information on page 12 of Lecture Notes 6 to find the normal vector of a hyperplane
and then use information on pages 4 and 5 of Lecture Notes 6.)
1.2) Consider a two-class classification problem where observations {(xi
, yi)}
n
i=1 can be completely
separated by a hyperplane. Consider a hyperplane S = {x ∈ R
p
: hx, αi + β0 = 0} with direction α
and intercept β0. Explain why the distance from xi to S is
dist(xi
, S) = yi(hxi
, αi + β0)
when kαk = 1? (Hint: you can read through pages 11 and 12 of Lecture Notes 6 and watch
corresponding lecture video clips.)
1.3) Consider a two-class classification problem where observations {(xi
, yi)}
n
i=1 can be completely
separated by a hyperplane. Consider a hyperplane S = {x ∈ R
p
: hx, αi + β0 = 0}. Why there are
infinitely many separating hyperplanes for these observations? What is the optimization problem
that the maximal margin classifier tries to solve? State the optimization problem mathematically
and explain the meaning of each term in the mathematical formulation. (Hint: you need to first
set up notations and then you can use information on page 13 of Lecture Notes 6.) Why does
the optimization problem have the constraint kαk = 1? (Hint: you can use partial answer to 1.2)
above.) Explain why the optimal hyperplane of the maximal margin classifier is equal distance from
either class of observations. (Hint: you can use information on pages 9 and 10 of Lecture Notes 6.)
1.4) Consider a two-class classification problem where observations can be completely separated by
a hyperplane. What are support vectors of the maximal margin classifier? Explain how you move
support vectors to change and not to change the maximal margin classifier, respectively.
1
1.5) Consider a two-class classification problem where observations {(xi
, yi)}
n
i=1 can not be completely
separated by a hyperplane. What optimization problem does a support vector classifier (SVC)
try to solve? State it mathematically and explain the meaning of each term in the mathematical
formulation. Explain how the value of a slack variable reveals how its associated observation is
classified by the resulting SVC, and explain how the value of the tolerance affects classification of
xi
’s, the number of support vectors, and the margin of the resulting SVC. (Note: please do NOT
just copy contents from the lecture notes and paste them as your answers.)
1.6) Consider a two-class classification problem where observations {(xi
, yi)}
n
i=1 can not be completely
separated by a hyperplane. When constructing an SVC by solving the optimization problem via
Lagrange multipliers, there is a “cost” parameter C. Explain how the value of the cost C affects
classification of xi
’s, the number of support vectors, and the margin of the resulting SVC. Is this C
the same as the tolerance mentioned in 1.5)?
1.7) Consider a two-class classification problem where training observations {(xi
, yi)}
n
i=1 can not
be completely separated by a hyperplane. When the decision boundary between the two classes
is nonlinear, what can you to to an SVC in order to deal with this situation, and what are some
disadvantages of what you propose to do? Is it true that an SVM is able to deal with this situation
and that it does so by implicitly enlarging the feature space using a kernel that can be different
from the Euclidean inner product? Provide a linear representation of an SVM, and comment on
how this representation is different from and similar to that for an SVC, respectively.
1.8) Describe how to conduct multi-class classification using SVMs.
Conceptual exercises: II (neural networks)
2.1) Describe how derived features are obtained by a vanilla, feedforward neural network that has 3
layers in total and 1 hidden layer.
2.2) Provide a criterion that is used to train a neural network for classification and for regression,
respectively.
2.3) What are some issues on training a neural network by optimizing a criterion you presented in
2.2), and how to deal with them?
Conceptual exercises: III (principal component analysis)
Assume there are p feature variables X1, . . . , Xp that are stored in the vector X = (X1, . . . , Xp)
T
.
Let X be an n × p data matrix whose ith row is the ith observation on X. Assume the covariance
matrix of X is Σ.
3.1) Describe in detail the population version of principal component analysis (PCA).
3.2) Provide the sample covariance matrix of X that is obtained from X. Describe in detail the
data version of PCA.
3.3) In the population version of PCA, the first principal component is a scalar random variable,
whereas in the data version of PCA, we have n scores for the first principal component. How are
the first principal component and its n scores related?
2
3.4) What does a biplot plot? How can you discover patterns in data using a biplot?
3.5) When implementing the data version of PCA based on X, is it recommended to center and
scale the observations in X? If so, how and why?
3.6) What is a criterion to use to choose the number of principal components?
3.7) Consider the scalar random variable w = a
T X for a ∈ R
p
. We want to find a ∈ R
p
for which
the variance of w is maximized. Explain why a should be an eigenvector associated with the largest
eigenvalue λ1 of Σ.
3.8) State the model and optimization problem PCA tries to solve when it is interpreted as the best
linear approximate to X under the Frobenius norm among all subspace of dimension q < p. How is
this optimization problem related to regression modeling based on the least squares method?
Applied exercises
Consider the data set iris from the R library ggplot2. Here is the instructor’s ggplot2 version
packageVersion("ggplot2")
## [1] '3.1.0'
You can use help(iris) to obtain some help information on this data set, or you can do the
following:
library(ggplot2)
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
unique(iris$Species)
## [1] setosa versicolor virginica
## Levels: setosa versicolor virginica
From the iris data set, pick all observations for the subspecies setosa or versicolor. This gives a
subset of 100 observations. From this subset, use set.seed(123) to randomly select 40 observations
for each of the 2 subspecies, and put the 80 observations thus obtained into a training set. The
remaining 20 observations in the subset then form a test set.
(4.1) Build an SVM using the training set with cost C = 0.1 and apply the obtained model to
the test set. Report classification results on the test set and provide needed visualizations. (Note:
plot for svm is not designed for more than 2 features, i.e., when an svm is built using more than 2
features and you apply plot to an svm object, you will get an error.)
3
(4.2) Build an SVM using the training set by 10-fold cross-validation and by setting set.seed(123),
in order to find the optimal value for the cost C from the range:
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100))
Apply the model to the test set, and report classification results on the test set. Do you think an
SVM with a nonlinear decision boundary should be used for this classification task? If so, please
use an SVM with a radial kernel whose parameters are determined by 10-fold cross-validation on
the training set and by setting set.seed(123).
(4.3) Use the training set, use set.seed(123), and apply 5-fold cross-validation to build an optimal
neural network model with 2 hidden layers of 5 and 7 hidden neurons, respectively. Apply the
optimal neural network model to the test set and report classification results. Note that you need to
make sure you know how the class labels are ordered by R and that this is explained in the lecture
video “Stat 437 Video 27b: neural network example 1”.
(4.4) Apply PCA to all features of the full data set iris. Plot the first two principal components
against each other by coloring each point on the plot by its corresponding subspecies. Do these
principal components reveal any systematic pattern on the features for any subspecies? Plot the
cumulative percent of variation explained by all (successively ordered) principal components.
4