Python代写-B 2021/2022|学霸联盟

Python代写-B 2021/2022

时间：2022-03-31

Homework 2 Machine Learning II, Semeter B 2021/2022
Notes: Please upload all your code with your assignment on Canvas before 3pm on April
1 (this is not a joke), 2022. Homework must be neatly written-up or typed for submission.
I reserve the right to refuse homework that is deemed (by me) to be excessively messy.
1. Probabilistic PCA and Factor Analysis. Suppose m < p and that W is some p×m matrix.
Suppose further that Z ∼ N(0, Im) and that X|Z ∼ N(µ+WZ, σ2Ip). This is called the
probabilistic PCA model.
(a) Let Σ = Cov(X). Prove that the eigenvectors of Σ are the same as the eigenvectors
of WW>. Thus, estimating the eigenvectors of Σ is equivalent to estimating the
eigenvectors of WW>.
(b) Given training data x1, . . . , xN ∈ Rp, prove that the MLE of µ is µ̂ = 1N
∑N
i=1 xi.
(c) Explain the relationship between probabilistic PCA and statistical factor analysis.
2. PCA. Consider the monthly log stock returns, in percentages and including dividends,
of Merck & Company, Johnson & Johnson, General Electric, General Motors, Ford Motor
Company, and value-weighted index from January 1960 to December 1999; see the file
m-mrk2vw.txt, which has six columns in the order listed before.
(a) Perform a principal component analysis of the data using the sample covariance matrix.
Try different number of principal components and report the variance explained in
each scenario.
(b) Perform a principal component analysis of the data using Radial kernel. Try different
number of principal components and use CV to tune the kernel parameter σ.
Homework 2 Machine Learning II, Semeter B 2021/2022
3. LARS, Lasso and Ridge. This problem uses the big8 dataset, which is available on
Canvas. The dataset contains information on 8 companies from the year 2004. The variable
RETX contains the daily simple returns. For this problem we take the return of the S&P500
index (labeled sprtrn in the data file) to be the output (i.e. y), and the returns of AIG,
C, COP, F, GE, GM, IBM, XOM (labeled RETX in the data file) on the same day to be
inputs (i.e. X). Divide the complete dataset into training and testing datasets, by date: The
training dataset should contain the data from Jan. 2, 2004 to June 30, 2004; the testing
dataset should contain the data from July 1, 2004 to Dec. 31. 2004. Run the following
regression methods on the training data:
(i) LARS
(ii) Lasso
(iii) Ridge regression
You may use the linear_model of sklearn to fit LARS, lasso and Ridge regression. Remember
to include the intercept term in your regression models and to standardize appropriately!
(a) For LARS and Lasso, list the sequence in which each of the predictors enter the
regression model.
(b) For ridge regression, fit the estimators for a fine grid of reasonably chosen tuning
parameter values λ (use at least 100 values of λ)
(c) For each method use (i) AIC and (ii) 5-fold cross- validation to pick a “final model”,
based only on the training data. Estimate the test error for your final models using
the test data, i.e. find the average prediction error. How does the test error for the
final models compare to the minimum test error for each method? (The minimum test
error for a given method is found by computing the test error for all tuning parameter
values, and then finding the minimum.)