STAT 435: Introduction to Statistical Machine Learning Spring 2021
Homework 02
Due May 05, 2021 by 11:59 PM
Instructor: Ema Perkovic´ TA: Apara Venkat
This homework is worth 35 points. There are 5 problems, and each begins on a new page. Question
1(c) and Question 5 are extra credit and are worth an additional 5 points. So it is possible to earn
40 points in this homework. Clearly list your collaborators, if any, at the top of your submission
(the final writeup should be your own). For example, “I worked with Hermione Granger on this
homework.” You can handwrite the math calculations, but submit a single PDF document to
Gradescope. Do not submit any R code unless explicity directed.
1. (8 + 2 pt) In this exercise, we will simulate data with a qualitative response, y = {1, 2}, and
perform k-nearest neighbors. Take p = 2 and n = 200. So, we have 200 data points and 2
di↵erent independent variables – {(xi1, xi2)}.
(a) (1 pt) First, let us generate data with a non-linear Bayes decision boundary. Start by
sampling a vector x1 of length 200 from N(2, 22). Next sample a vector x2 of length
200 from Uniform(10, 10). Now, generate a vector ✏ of random noise N(0, 1) – this
ensures that our decisions are imperfect. Now compute the decision boundary as:
f(x1, x2) = x2 1
4
(x1 2)2 + 1 + ✏ (1)
Use the following classification criteria to label your data points:
yi =
(
1, f(xi1, xi2) > 0
2, else
(2)
Ensure that you have (approximately) equal number of points for y = 1 and y = 2. (120,
80) is okay, but (190, 10) is a skewed dataset.
Suppose you wanted to generate data with a linear decision boundary. How would you
modify Equation 1? How would you modify Equation 2 to allow for more than 2 classes?
(b) (1 pt) Plot the data points {(xi1, xi2)}. Color code each point based on the response yi.
Also plot the decision boundary. Present a single plot for this question. See Figure 2.13
in ISLR for an example.
(c) Extra Credit (2 pt) Write a function knn(x train, y train, x new, k) that takes
in training data, labels for the training data, a new data point, and the number of
neighbors k. This function will perform k-nearest neighbors and return the label for the
new data point. Submit the code for just this function. Do not submit extra code.
Hint: You can speed up your function by computing distances between all pairs of points,
sorting them, and storing the sorted neighbors in a matrix beforehand (this speed-up
technique is called memoization). Then you will simply need to query this matrix to get
the k nearest neighbors for any point.
1
(d) (2 pt) Perform k-nearest neighbors on the 200 training data points for k = 1, 2, . . . , 50.
Compute the training error for each k. What is the k with the lowest training error?
Present a single plot where all the training data points are predicted using this k with
the true Bayes decision boundary. Also present a 2⇥ 2 confusion matrix.
You can use the knn function from the class package for this question.
(e) (2 pt) Using the same data generating process in (a), generate 5000 new points. This
will constitute your test dataset. Perform k-nearest neighbors on the 5000 test data
points for k = 1, 2, . . . , 50. Compute the test error for each k. Report the smallest test
error and the corresponding k.
Hint: When performing k nearest neighbors on the test data, you will check for neighbors
among the training data, and not among the test data.
(f) (2 pt) Make a single plot with 1/k on the horizontal axis and the error on the vertical
axis. Plot the training error and the test error as two separate lines in the same plot.
Choose the scale of the axes so the trends can be seen clearly. See Figure 2.17 in ISLR
for an example. Comment on what you see. Does anything surprise you? What k will
you choose based on your results?
2
2. (9 pt) In this question, we will generate binary data using a simple probabilistic model. We
will fit a logistic regression model and attempt to recover the true parameters.
(a) (2 pt) First, we will generate our responses. We will use the logit model (or the lo-
gistic distribution) to help us. Start by sampling a vector x of length n = 100 from
Uniform(10, 10). Recall that under the logistic distribution, we have the following
probability of a success is given as
P (Y = 1 | X = x) = 1
1 + e(0+1x)
(3)
Take [0,1] = [1, 2]. Compute p, the vector of success probabilities using Equation 3.
Note that y | pi ⇠ Bernoulli(pi). So we can simulate the outcomes using a Bernoulli
(or binomial) trial. Generate yi with the rbinom(1, size = 1, prob = p i) function,
where p i is P (Y = 1 | X = xi). Generate one outcome for each of the 100 xi values.
Call this vector of outcomes, y.
Present two plots. On the first, plot x vs. p. On the second, plot x vs. y.
Hint: This following code snippet may be useful to avoid for loops
# generate vector p of length n as defined above
# y is a vector of length n where y[i] is Bernoulli(p[i])
y <- rbinom(n = n, size = 1, prob = p)
(b) (2 pt) Fit a logistic model in R using the glm function on this data. Report the estimated
coecients
and the standard errors. Interpret the estimated b1 coecients. Are b0
andb1 close to the true parameters? If not, provide possible
explanations as to why they
are di↵erent.
(c) (2 pt) Compute the predicted probabilities bp using the predict function with argument
type = ‘response’. Using the predicted probabilities, declare the predicted outcome
as byi = 1 if bpi 0.5 and byi = 0 otherwise. Generate and present a 2 ⇥ 2 confusion
matrix. Also compute the error in predictions using the confusion matrix. Comment on
your observations.
(d) (1 pt) For this question alone, repeat parts (a) and (b) with n = 103 and n = 104 (no
need for interpretation). Compare the estimated coecients between di↵erent values of
n against the true parameters. Comment on what you see.
(e) (2 pt) In this question, we will go back to n = 100 case and attempt to fit a linear model
instead. First fit a linear model y := a+ bx. Obtain the predicted values using predict
function. Call the predicted values `. Declare byi = 1 if `i 0.5 and byi = 0 otherwise.
Generate and present a 2⇥ 2 confusion matrix. Compare this with your answer in part
(c). The numbers should look similar. Give a plausible explanation.
Hint: If you sampled x from Uniform(a, a) for a > 0, would your answers in part
(c) and (e) still look similar? What if you generated x from Uniform(a, b) where
a < 0, b > 0 and |a| 6= |b|? What about Uniform(a, b) where a, b > 0?
3
3. (9 pt) In this question, we will simulate new data and explore linear and quadratic discrim-
inant analysis (LDA and QDA) techniques. In particular, we will us p = 2 features, K = 3
classes, and n = 50⇥K = 150 total data points (50 per class).
(a) (1 pt) Fix µ1 = [3, 2], µ2 = [1, 0], µ3 = [5, 2]. And fix the covariance matrices
⌃1 = ⌃3 =
2 0
0 2
⌃2 =
4 3
3 5
Generate 50 data points for each class k = 1, 2, 3 that follow a distribution N(µk,⌃k)
using the mvrnorm function from the MASS package. Make a single plot that shows a
scatter plot of all 150 data points color coded by class. Scale your axes appropriately.
(b) (1 pt) Fit a linear discriminant analysis model (LDA). Report the training error and
3⇥ 3 confusion matrix.
(c) (1 pt) Fit a quadratic discriminant analysis model (QDA). Report the training error and
3⇥ 3 confusion matrix. Compare with the results in part (b).
(d) (2 pt) Generate n = 500 test observations for each class using the same procedure in part
(a). Report the test error using LDA. Also report the test error using QDA. Compare
these with the training errors. Which model do you prefer?
(e) (2 pt) What is the key di↵erence between LDA and QDA? Which of the two methods has
more parameters to estimate? Based on the number of parameters each model estimates,
which model is simpler and which model is more generalizable?
(f) (2 pt) Based on the true data generation process in part (a) and your answer to part
(e), would you prefer LDA or QDA? Reconcile any discrepancies with your answer in
part (d).
4
学霸联盟