STA 142A: Homework 3
• Homework due in Canvas: 02/26 (Friday) at 11:59PM.
• Please follow the instruction in canvas regarding HWs.
• You are encouraged to discuss about the problems with your classmates.
But copying of the homework constitutes a violation of the UC Davis
Code of Academic Conduct and appropriate action will be taken.
1. Smoothing Splines. Question 2 from page 298 and question 5 from page 299. (For any
plots involved, just a hand-drawn plot is sufficient)
2. Trees and Bagging. Question 1 from page 332 and question 5 from page 332. (For any
plots involved, just a hand-drawn plot is sufficient)
3. GAM+Splines For this question, pyGAM package will be useful.
In this question, we will do a binary classification with multivariate input data. To handle
the multivariate nature, we will use a generalized additive model. Let X ∈ Rp represent the
input random variable and Y represent the output random variable for Binary classification
(note we let Y ∈ {0, 1} instead of Y ∈ {−1, 1} which we typically did in class, as PyGAM
package follows that convention). Let the conditional distributions be as follows:
(a) For even j, the jth-coordinate of X is distributed as
Xj |(Y = 1) is a t− distribution with 1 degree of freedom with mean 2
Xj |(Y = 0) is a t− distribution with 1 degree of freedom with mean 0.
(b) For odd j, the jth-coordinate of X is distributed as
Xj |(Y = 1) is an exponential distribution with λ = 1.
Xj |(Y = 0) is an exponential distribution with λ = 3.
and let P (Y = 1) = 0.5. Details about t-distribution and exponential distribution could be
found in the wikipedia links here and here, respectively. You could use
np.random.standard t, numpy.random.exponential and np.random.binomial for this ques-
tion.
(a) Let p = 10. Repeat the following procedure for 100 trails: Generate n = 100 train-
ing data samples (x1, y1), . . . , (x100, y100) from the above model. Note that here each
xi ∈ R
p, and for all i, xi,j represents the j
th co-ordinate of the ith training sample,
which follows the above generating process. Train a logistic generalized additive model
classifier on this training data (you could use LogisticGAM from the pyGAM package).
Generate n = 100 testing data from the same model. Note that you will know the true
labels in this testing data as you generated it. Plot a box-plot of the test error. What
is the mean and variance of the test errors ?
(Here, for each trail, the test error is defined as the number of misclassified samples on
the testing data. Also, when running LogisticGAM command, there might be warnings
on non-convergence; please feel free to ignore such warnings. Finally, this experiment
might take sometime to run (about 10 minutes on a reasonable laptop))
1
2.
(b) Repeat the above procedure with p = 30. Comment on the running time and test error
differences form the previous case.
2
学霸联盟