ST309-程序代写案例|学霸联盟

ST309-程序代写案例

时间：2021-10-31

ST309 – Exercise 4
This counts for 10% of the final assessment of the course.
Please submit your solutions in a pdf file to Moodle by the noon of Friday, 5 November. Late submission entails
penalties: 10 marks (out of maximum 100) will be deducted for each working day. Submissions are not accepted
after the noon of Tuesday, 9 November.
You should only submit your own work, and cannot use materials from the past and/or your classmates.
Plagiarism is a very serious offense that is quite easy to detect. It will result in instant failure (mark 0).
1. For the date set “college.txt” created in Exercise 2-4, our goal now is to create some classification models
in order to predict if a college is elite or not using all variables in the data except Top10perc (as Elite
is defined by Top10perc).
(a) Fit a decision tree using all the available data. [5 marks]
Note. When you import the data from your saved file, you may need to re-define the class
> college$Elite=as.factor(college$Elite)
> college$Private=as.factor(college$Private)
as R may treat them as charater strings instead of binary factors. Check summary(college).
(b) Randomly divide the data into two sets: the training set with the sample size 500 and the testing
set. Re-fit a decision tree using training data only and check its performance using the testing data.
[5 marks]
(c) Fit a logistic model for the training data using a subset of variables as predictors. You may choose
the predictors by examining the two fitted tree models in (a) and (b). Refine the fitted logistic model
by removing insignificant variables. Also compare different models using the test data created in (b)
above. [10 marks]
2. This question involves the use of multiple linear regression on the data set Auto which is included in the
package ISLR:
> library(ISLR)
> View(Auto)
> ?Auto
> summary(Auto) # Always a good idea to looks at summary first!
(a) Produce a scatterplot matrix which includes all of the variables except name in the data set using the
function pairs(). [3 marks]
(b) Compute the matrix of correlations between the variables using the function cor(). You will need
to exclude the name variable. [3 marks]
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other
variables except name as the predictors. Use the summary() function to print the results. [4 marks]
i. Is there a relationship between the predictors and the response? [2 marks]
ii. Which predictors appear to have a statistically significant relationship to the response? You may
comment with the reference to the plot produced in (a) above. [4 marks]
iii. What does the coefficients for year and origin suggest? [4 marks]
iv. State how you would deal with those insignificant predictors. (You are not required to do it.)
Again you should make reference to the plot in (a). [4 marks]
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any
problems you see with the fit. Do the residual plots suggest any unusually large outliers? [4 marks]
(e) Plot pairs(data.frame(log(Auto$mpg),Auto[,-c(1,9)])). Compare it with the plot in (a). Does
it suggest any possible improvement in modelling mpg? [3 marks]
3. We examine the impact of the correlations such as collinearity and endogeneity on regression.
(a) To simulate from the regression model
y = β0 + β1x1 + β2x2 + ε, (1)
we perform the following commands in R:
> set.seed(3)
> x1=runif(150) # 150 U(0,1) random numbers
> x2=0.5*runif(150)+rnorm(150)/5 # rnorm(150) returns 150 N(0,1) random numbers
> y=2+2*x1+x2+rnorm(150)
Now x1 and x2 are independent.
i. Write down β0, β1, β2 used in the simulation. What is the distribution of ε? [4 marks]
ii. Using the first 100 data points to estimate the regression model for y with both x1 and x2 as
regressors. Identify 95% confidence intervals for β0, β1, β2. [7 marks]
iii. Use the fitted model to predict the last 50 data points of y, and calculate the root mean squared
predictive error (rMSPE) { 1
50
∑
i(yi − ŷi)
2}1/2. [7 marks]
Hint: For ii. and iii., you may run
> ytrain=y[1:100]; ytest=y[101:150]
> x=data.frame(x1, x2); x.train=x[1:100,]; x.test=x[101:150,]
> m1=lm(ytrain~x1+x2, data=x.train)
> summary(m1)
... ...
(b) Now we simulate model (1) as follow:
> set.seed(3)
> x1=runif(150)
> x2=0.5*x1+rnorm(150)/5
> y=2+2*x1+x2+rnorm(150)
i. The model is formally the same as in (a) except that x1 and x2 are correlated now, which is
refereed as collinearity. Note that the marginal distributions of x1 and x2 are also unchanged.
But they are correlated with each other. Their correlation is
ρ =Corr(x1, x2) =
Cov(x1, x2)√
Var(x1)Var(x2)
=
0.5Var(x1)√
Var(x1)Var(x2)
= 0.5
√
Var(x1)
Var(x2)
=0.5
√
1/12
0.25/12 + 1/25
= 0.585.
Note the variance of U(0,1) is 1/12.
Estimate ρ using R-function cor from the data. [2 marks]
ii. Repeat ii. and iii. in (a) above. [6 marks]
iii. Compare and comment the results obtained now with those in (a). [4 marks]
iv. Since x2 is not significant in the fitted model, fit the data with the model without x2. Compare
and comment on your findings. [6 marks]
(c) The following commands also simulate a sample from model (1).
> set.seed(3)
> x1=runif(150)
> epsilon=rnorm(150)
> x2=0.5*runif(150)+epsilon/5
> y=2+2*x1+x2+epsilon
i. Now the model is formally still the same as in (a) except that x2 and ε are correlated, which is
refereed as endogeneity. Estimate the correlation between x2 and ε from the data. [2 marks]
ii. Repeat ii. and iii. in (a) above. [6 marks]
iii. Compare and comment the results obtained now with those in (a), especially why rMSPE now
is much smaller than those in (a) and (b). [5 marks]