Starting from:

$30

COMS 4771  HW2

COMS 4771  HW2

You are allowed to write up solutions in groups of (at max) two students. These group members
don’t necessarily have to be the same from previous homeworks. Only one submission per group
is required by the due date on Gradescope. Name and UNI of all group members must be clearly
specified on the homework. No late homeworks are allowed. To receive credit, a typesetted copy of
the homework pdf must be uploaded to Gradescope by the due date. You must show your work to
receive full credit. Discussing possible solutions for homework questions is encouraged on piazza
and with peers outside your group, but every group must write their own individual solutions. You
must cite all external references you used (including the names of individuals you discussed the
solutions with) to complete the homework.
1 [Constrained optimization] Show that the distance from the hyperplane g(x) = w·x+w0 =
0 to a point xa is |g(xa)|/kwk by minimizing the squared distance kx − xak
2
subject to the
constraint g(x) = 0.
2 [A better output Perceptron algorithm guarantee] In class, we saw that when the training
sample S is linearly separable with a maximum margin γ 0, then the Perceptron algorithm
run cyclically over S is guaranteed to converge after T ≤

R/γ?2
updates, where R is the
radius of the sphere containing the sample points. This does not guarantee however that the
hyperplane solution returned by Perceptron, i.e. wT achieves a margin close to γ.
(i) Show an example training dataset S in R
2
that has margin γ, and an order of updates
made by the Perceptron algorithm where the hyperplane solution returned has arbitrarily
bad margin on S.
(ii) Consider the following modification to the perceptron algorithm:
Modified Perceptron Algorithm
Input: training dataset S = (xi
, yi)i=1,...,n
Output: learned vector w
- Initialize w0 := 0, t := 0
- while there exists an example (x, y) ∈ S, such that 2y(wt · x) ≤ γkwtk
- set wt+1 := wt + yx
- set t := t + 1
- return wt.
(a) If the Modified Perceptron Algorithm (MPA) terminates after T rounds, what margin guarantee is achieved by the hyperplane wT returned by MPA? Justify your
answer.

(b) We will now prove step-by-step the mistake bound for the Modified Perceptron
Algorithm (MPA) algorithm.
i. Show that after T rounds T γ ≤ kwT k, and observe that if kwT k < 4R2/γ,
then T < 4R2/γ2
.
In what follows, we will assume that kwT k ≥ 4R2/γ.
ii. Show that for any iteration t when mistake was made, the following holds:
kwtk
2 ≤ (kwt−1k + γ/2)2 + R
2
.
iii. Infer from that that for any iteration t, we have
kwtk ≤ kwt−1k + γ/2 +
R2
kwt−1k + kwtk + γ/2
.
iv. Using the previous question, show that for any iteration tsuch that either kwt−1k ≥
4R
2
γ
or kwtk ≥ 4R
2
γ
, we have
kwtk ≤ kwt−1k +
3
4
γ.
v. Show that kw0k ≤ R ≤ 4R2/γ. Since by assumption we have kwT k ≥ 4R
2
γ
,
conclude that there must exist some largest iteration t0 such that kwt0−1k ≤
4R
2
γ
and kwt0
k ≥ 4R
2
γ
.
vi. Show that kwT k ≤ kwt0−1k +
3
4
T γ, and finally deduce the mistake bound.
3 [Making data linearly separable by feature space mapping] Consider the infinite dimensional feature space mapping
Φσ : R → R

x 7→
?
1

|α − x| < σ
· exp
− 1/(1 − (|α − x|/σ)
2
)
?
?
α∈R
.
(It may be helpful to sketch the function f(α) := 1

|α| < 1

· exp
− 1/(1 − α
2
)
?
for
understanding the mapping and answering the questions below)
(i) Show that for any n distinct points x1, . . . , xn, there exists a σ 0 such that the mapping
Φσ can linearly separate any binary labeling of the n points.
(ii) Show that one can efficiently compute the dot products in this feature space, by giving
an analytical formula for Φσ(x) · Φσ(x
0
) for arbitrary points x and x
0
.
4 [Designing socially aware classifiers] Traditional Machine Learning research focuses on
simply improving the accuracy. However, the model with the highest accuracy may be discriminatory and thus may have undesirable social impact that unintentionally hurts minority
groups1
. To overcome such undesirable impacts, researchers have put lots of effort in the field
called Computational Fairness in recent years.
1
see e.g. Machine Bias by Angwin et al. for bias in recidivism predication, and Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification by Buolamwini and Gebru for bias in face recognition
Two central problems of Computational Fairness are: (1) what is an appropriate definition of
fairness that works under different settings of interest? (2) How can we achieve the proposed
definitions without sacrificing on prediction accuracy?
In this problem, we will focus on some of the ways we can address the first problem. There are
two categories of fairness definitions: individual fairness2
and group fairness3
. Most works in
the literature focus on the group fairness. Here we will study some of the most popular group
fairness definitions and explore them empirically on a real-world dataset.
Generally, group fairness concerns with ensuring that group-level statistics are same across
all groups. A group is usually formed with respect to a feature called the sensitive attribute.
Most common sensitive features include: gender, race, age, religion, income-level, etc. Thus,
group fairness ensures that statistics across the sensitive attribute (such as across, say, different
age groups) remain the same.
For simplicity, we only consider the setting of binary classification with a single sensitive
attribute. Unless stated otherwise, we also consider the sensitive attribute to be binary. (Note
that the binary assumption is only for convenience and results can be extended to non-binary
cases as well.)
Notations:
Denote X ∈ R
d
, A ∈ {0, 1} and Y ∈ {0, 1} to be three random variables: non-sensitive
features of an instance, the instance’s sensitive feature and the target label of the instance
respectively, such that (X, A, Y ) ∼ D. Denote a classifier f : R
d → {0, 1} and denote
Yˆ := f(X).
For simplicity, we also use the following abbreviations:
P := P(X,A,Y )∼D and Pa := P(X,a,Y )∼D
We will explore the following are three fairness definitions.
- Demographic Parity (DP)
P0[Yˆ = ˆy] = P1[Yˆ = ˆy] ∀yˆ ∈ {0, 1}
(equal positive rate across the sensitive attribute)
- Equalized Odds (EO)
P0[Yˆ = ˆy | Y = y] = P1[Yˆ = ˆy | Y = y] ∀y, y ˆ ∈ {0, 1}
(equal true positive- and true negative-rates across the sensitive attribute)
- Predictive Parity (PP)
P0[Y = y | Yˆ = ˆy] = P1[Y = y | Yˆ = ˆy] ∀y, y ˆ ∈ {0, 1}
(equal positive predictive- and negative predictive-value across the sensitive attribute)
2
see e.g. Fairness Through Awareness by Dwork et al.
3
see e.g. Equality of Opportunity in Supervised Learning by Hardt et al.
3
Part 0: The basics.
(i) Why is it not enough to just remove the sensitive attribute A from the dataset to achieve
fairness as per the definitions above? Explain with a concrete example.
Part 1: Sometimes, people write the same fairness definition in different ways.
(ii) Show that the following two definitions for Demographic Parity is equivalent under our
setting:
P0[Yˆ = 1] = P1[Yˆ = 1] ⇐⇒ P[Yˆ = 1] = Pa[Yˆ = 1] ∀a ∈ {0, 1}
Part 2: In this part, we will explore the COMPAS dataset (available in hw2data.zip).
The task is to predict two year recidivism. Download the COMPAS dataset from the class’s
website. In this dataset, the target label Y is two year recid and the sensitive feature A
is race.
(iii) Develop the following classifiers: (1) MLE based classifier, (2) nearest neighbor classifier, and (3) decision tree classifier, for the given dataset.
For MLE classifier, you can model the class conditional densities by a Multivariate Gaussian distribution. For nearest neighbor classifier, you should consider different values of
k and the distance metric (e.g. L1, L2, L∞).
(you may use builtin functions to develop your classifier, and you do not need to write
the classifier form scratch.)
You do not need to submit any code on Courseworks.
(iv) Which classifier (discussed in previous part) is better for this prediction task? You must
justify your answer with appropriate performance graphs demonstrating the superiority
of one classifier over the other. Example things to consider: how does the training sample size affects the classification performance.
(v) To what degree the fairness definitions are satisfied for each of the classifiers you developed? Show your results with appropriate performance graphs.
For each fairness measure, which classifier is the most fair? How would you summarize
the difference of these algorithms?
(vi) Choose any one of the three fairness definitions. Describe a real-world scenario where
this definition is most reasonable and applicable. What are the potential disadvantage(s)
of this fairness definition?
(You are free to reference online and published materials to understand the strengths and
weaknesses of each of the fairness definitions. Make sure cite all your resources.)
4

More products