ESE 402/542 Final Review
Core Concepts in the Second-Half
I Regression: least-squares.
I Classification: Logistic Regression, Linear Discriminant
Analysis (LDA).
I Unsupervised Learning: k-means clustering, Principal
Components Analysis (PCA).
I A Unified Framework: Probably-Approximately-Correct (PAC)
Learning. VC-dimension.
Notes on the Final
I More emphasis on general understanding of methods discussed
in class: motivations, where to use them, limitations.
I Should be little computational math once question is
understood.
Classification: Logistic Regression
I Motivation: given data points {xi}, want to assign labels
yi ∈ {0, 1} (can be extended beyond two-class) to each point.
I Simple approach: fit probabilities P [yi = 0|x ] (probability of
yi = 1 follows). Whichever probability is greater for yi , we
choose that label.
I How to fit these probabilities as a function of x? First
attempt: linear!
P [y = 0|X = x ] = β0 + β1x .
What’s the problem? Linear fitting can be negative, can be
greater than 1. No way to enforce these constraints without
ruining simplicity. Have to do something (slightly) smarter.
Classification: Logistic Regression
I What’s a way to parameterize a probability? Exponential
function is natural choice to ensure non-negativity.
Normalizing is way to ensure `1:
P [y = 0|X = x ] = exp (β0 + β1x)
1 + exp (β0 + β1x)
=⇒ P [y = 1|X = x ] = 1
1 + exp (β0 + β1x)
I Logistic regression problem is now to fit parameters β0 and β1.
I Natural way to do this? Maximum likelihood estimation.
Classification: LDA
I Broad idea: assuming that the conditional distribution of data
x given fixed label y = 0 (accordingly, y = 1) is Gaussian.
I Main simplifying assumption: covariance matrix of y = 0 and
y = 1 conditional distributions are assumed to be identical (if
not, leads to Quadratic Discriminant Analysis).
I Solve for LDA using Maximum Likelihood Estimation. Check
class notes for details.
PCA
I Have data {xi ∈ Rd}ni=1, assume 0-mean (o.w. center it)
I Want to project each xi to a lower dimensional representation
zi ∈ Rd ′ using a linear transformation
I Let Σ = 1n
∑n
i=1 xix
>
i be the covariance matrix of the data
I Use SVD Σ = UΛU> where Λ = diag(σ21, . . . , σ2n)
I Pick the first d ′ eigenvectors (columns of U)
I This is the basis to project onto
I Total energy (variance of centered data) is given by
∑n
i=1 σ
2
i
I Variance explained is given by
∑d ′
i=1 σ
2
i
I Error of projection is given by
∑n
d ′+1 σ
2
i
Vapnis-Chervonenkis (VC) Dimension
I General question: why don’t we just use very expressive
functions to perfectly classify everything?
I Intuition developed so far: we should worry about
generalization error of such classifiers (overfitting).
I What is a way to capture “expressivity” of a function class?
Enter the VC dimension.
Vapnis-Chervonenkis (VC) Dimension
I Given classifier function class H = {h(·)} (e.g. linear
predictors, logistic classification, neural nets etc), quantify its
expressivity by “shattering number”.
I Shattering number: if H has shattering number at least n,
means for any choice of n data points {xi}ni=1, and for any
labels {yi}ni=1, there exists a function h ∈ H such that
h(xi ) = yi for all i = 1, . . . , n.
VC dimension can be thought of as the largest shattering
number H can achieve.
I Idea: if function class has high VC dimension, it can perfectly
classify any labelling for any large dataset. This might arouse
suspicion, as this includes randomly assigned labels.
PAC
I “Probably-Approximately-Correct” Learning. To break it
down: fixing failure probability δ ∈ [0, 1], we want to know the
a lower bound on the dataset size n such that taking
probability over all datasets S of size n,
PS
[
|err(hS)− err(h∗)| < ε︸ ︷︷ ︸
“Approximately Correct”
]
≥ 1− δ︸ ︷︷ ︸
“Probably”
I General idea of PAC: want to ensure that empirical solution,
i.e. any classifier/function fit that we learn from finite
amounts of data, gets comparable performance to the best
choice of function (Bayes optimal classifier), at least most of
the time. This is a theoretical quantification of generalization
performance.
2019SP P1 Parts 1-3
I done in lecture/recitation (MLE under normally distributed
noise is equivalent to least-squares)
2019SP P1 Part 4
Problem: Suppose yi = β0 + β1xi + i , where i
i .i .d .∼ N (0, σ2i ).
Find the MLE of β0, β1.
Solution: yi ∼ N (β0 + β1xi , σ2i ), so
f (yi ) =
1√
2piσi
exp
(
− 1
2σ2i
(yi − β0 − β1xi )2
)
max
β0,β1
∑
log f (yi ) = max
β0,β1
∑
− 1
σ2i
(yi − β0 − β1xi )2
(constant terms don’t depend on betas)
= min
β0,β1
∑ 1
σ2i
(yi − β0 − β1xi )2
2019SP P1 Part 4 cont.
Take derivatives...
2019SP P2 Part 1
Problem: Solve minc
∑n
i=1 wi‖xi − c‖22
Solution: Take derivative w.r.t c and set to 0.∑
i
2wi (xi − c) = 0
∑
i
wixi − c
∑
i
wi = 0
c =
∑
i wixi∑
i wi
2019SP P2 Part 2
Problem: Extend k-means to the weighted setting:
n∑
i=1
wi min
j∈[K ]
‖xi − cj‖22
Solution: Each data point has a weight wi . Alternate finding
centroids and assigning points to clusters, except with weights. So
to find centroids, instead of setting cj to the average of the points
in the clusters, use the weighted mean (in previous part). To
assign points to clusters, this remains the same as before.
2019SP P3
Given in HW7, and in previous slide.
2020FA P2 Part 1
Q: Given P(Y = +1) = 1− P(Y = −1) = 3/4,
P(X = x |Y = y) = 12e−|x−2y |, find P(X = x ,Y = y)
A: Y is Bernoulli(3/4). P(X = x ,Y = y) = P(Y = y)P(X =
x |Y = y) = (3/4)y (1/4)1−y · 12e−|x−2y |.
2020FA P2 Part 2
Q: Plot P(X = x ,Y = +1), P(X = x ,Y = −1).
A: Two Laplace distributions, one centered at 2, the other at -2.
The one centered at 2 should be larger.
2020FA P2 Part 3
Q: Derive Bayes optimal classifier.
A: Find
argmaxy P(Y = y |X = x) = argmaxy P(Y = y |X = x)P(X =
x) = argmaxy P(Y = y ,X = x) = argmaxy{38e−|x−2|, 18e−|x+2|}
where the first option corresponds to y = +1, second is y = −1.
Decision rule: 38e
−|x−2| +1≷
−1
1
8e
−|x+2|. Taking logs and stuff:
log 3− |x − 2|
+1
≷
−1
−|x + 2|. By the plot, only need to consider when
x ∈ [−2, 0]. Then, log 3− (−x + 2)
+1
≷
−1
−x − 2 =⇒ x
+1
≷
−1
− log 32 .
2020FA P2 Part 4
Q: Error rate of Bayes optimal?
A: Let c = − log 32 . P(h∗(x) 6= y) = P(X > c |Y =
−1) · 1/4 + P(X < c|Y = +1)3/4 = 14Φ(−2) + 34Φ(−2) where Φ
is the Laplace(0,1) CDF.
2020FA P2 Part 5
QDA (will not be covered)
2020FA P2 Part 6
QDA
2020FA P2 Part 7
QDA
2020FA P2 Part 8
QDA
2020FA P3 Part 1
Q: Find the VC dimension of H1.
A: We claim the VC dimension is 1. To see this, we observe that
any 1 point can be classified correctly. Without loss of generality,
let x = c for any real number c ∈ R. Then if the corresponding
label is y = 1, we can simply set a = c for our classifier. If the
corresponding label is y = 0, we can set a = c + 106 (no real
significance to the number, just ensures the classifier chooses
h(c) = 0).
On the other hand, if we have 2 data points, we observe that for
any two data points labelled y = 1, they must necessarily satisfy
|x1 − x2| ≤ 6, since the only two intervals on which any classifier
classifies h(x) = 1 have distance upper bounded by 6. So, all we
need to do is select x1 = c , and x2 = c + 10
6, so that no single
classifier can find h(x1) = h(x2) = 1 simultaneously.
2020FA P3 Part 2
Q: Find the VC dimension of H2.
A: 0. As defined, for any ha,b, the origin x = 0 is necessarily
classified as ha,b(x) = 1. Therefore, let us consider one point
placed at the origin with label y = 0. No function ha,b will be able
to classify this point correctly.
