Python代写-DSC140A|学霸联盟

Python代写-DSC140A

时间：2022-04-25

UC San Diego
Halıcıog˘lu Data Science Institute
DSC140A Probabilistic Modeling and Machine Learning
Problem Set 2
Due: Friday, April 22 2022 @ 11:59 PM PST
Additional References
⋆ Lectures 4, 5, 6
⋆ Recitations 2, 3
⋆ ISL 3.2, 3.3, 5.2, 6.2
⋆ Shalizi 12.1, 12.5, 13, 15, 20, 24.5
1. Simple Linear Regression
Consider a simple linear regression model
yˆi = wˆ0 + wˆ1xi.
(a) Let RSS =
∑
i=1(yi − yˆi)2 denote the residual sum of squares. Show that the following
parameters minimize the RSS:
wˆ1 =
∑n
i=1(xi − x¯)(yi − y¯)∑n
i=1(xi − x¯)2
(1)
wˆ0 = y¯ − wˆ1x¯ (2)
(b) Show that the simple linear regression line should always pass through the point (x¯, y¯).
(c) We have R2 = TSS−RSSTSS where TSS =
∑
(yi − y¯)2 is the total sum of squares. Show that the
R2 statistic is equal to the square of the correlation between X and Y . For simplicity, you
may assume that x¯ = y¯ = 0
(d) Under the simple line regression model y = wˆ0 + wˆ1x, show that Var(y¯) =
σ2
n . Here y¯ is the
mean of the predicted variable y. You may treat the x as fixed in this calculation.
(e) We now consider an even simpler version of this model without an intercept. In this setting,
we have that yˆi = xiwˆ where
wˆ =
∑n
i=1 xiyi∑n
k=1 xk
2
Show that we can write
yˆi =
n∑
k=1
akyk.
Where ak is expressed in terms of x. Provide an intuitive description of this representation.
2. Linear Regression with Multiple Measurements
Consider the case where we have two different datasets with measurements from slightly different
conditions. The two datasets have n and m examples respectively. In the first dataset, the n inputs
are governed by
y(i) ∼ N(θ · x(i), σ21), i = 1, . . . , n
for some θ we need to estimate. In the second dataset, we have m training examples (x(i), y(i)), i =
n+ 1, . . . , n+m, where
y(i) ∼ N(θ · x(i), σ22), i = n+ 1, . . . , n+m
with the same parameter θ. Let θˆML denote the maximum likelihood estimate of the parameters
based on two datasets.
(a) What should θˆML reduce to if σ1 = σ2?
(b) What is θˆML if σ1 ̸= σ2?
(c) Describe why θˆML does the right thing even when σ1 ≫ σ2.
3. Log Transformations
Figure 1 shows the relationship between the price of rice in 2003 (X) and the price of rice in 2009
(Y ). The plot on the left shows X vs Y while the plot on the right shows log(X) vs log(Y ).
Figure 1: Price of rice in 2003 and 2009 in major cities. Each plot includes the best-fit curve using
simple linear regression.
(a) Denote the regression curve on the left plot as yˆ = wˆ0 + wˆ1x1. We can clearly observe that
wˆ1 < 1. Does this suggest that the prices were lower in 2009 than in 2003?
(b) Denote the regression curve on the right plot as log(yˆ) = vˆ0 + vˆ1 log(x1). Provide an intuition
description of the slope parameter vˆ1.
(c) The Cobb-Douglas production is a model with the form:
E(y | x) = γxw
where x is an input and y is an output.
Page 2 of 4
i. Convert this form to a simple linear regression setting.
ii. Describe the influence of the γ parameter on the model. What happens when γ0 > 1?
What about when γ0 < 1?
4. Omitted Variable Bias
Consider simple linear regression where our true model is
y = w0 + w1x1 + w2x2 + ϵ
Note that there is a key assumption for computing an unbiased estimates of wˆ variables. If that
assumption is false, then it can induce bias in our estimates. Now suppose we fit an model of the
form
y = α0 + α1x1 + ϵ.
That is, a model that is fit without x2. In this setup, the failure to include x2 may lead to estimates
of α0 and α1 that suffer from omitted variable bias.
(a) Which assumption is violated by omitting x2 in our setting?
(b) Derive the magnitude of the bias for estimating α1?
(c) Design a real world example of a regression dataset where y depends on 2 relevant variables (x1,
x2) and one irrelevant variable (x3). Show that dropping x2 in a simple linear regression model
leads to a bias while the irrelevant variable x3 does not. Feel free to assume the covariance
matrix.
5. Design a Multiple Choice Question
For this problem, we ask that you work on your own. We will grade responses for completeness and
correctness. We may include the best question(s) in the midterm or the final.
Design a multiple choice question (using this Google Forms link) to test one of the concepts covered
in lectures 3 – 5. Your question should be multiple choice (i.e., “choose one answer” or “choose all
good answers.”).
• Please use this Google Forms link to add your question.
• Your question should include what you think the right answer(s) should be.
• Good questions should be relevant to a specific lecture, straightforward to answer, but hard
to get full credit on otherwise.
• Good questions will come from the ”Comprehension” or ”Application” levels of Bloom’s Tax-
onomy. Higher levels are better.
• See, for example, Designing Great Multiple Choice Questions for additional guidance.
6. Computational Problem I: Multicollinearity
Create an array of 100 values that are sampled from a Uniform
(
0, 12
)
and call it x1. Then, define
another array called x2, where x2 = 0.5 × x1 + X/10, X ∼ Uniform
(
0, 12
)
. Finally define y =
2 + 2x1 + 0.3x2 +X,X ∼ Normal
(
0, 12
)
. Here, y is a function of x1 and x2.
(a) Write out the form of the linear model. What are the regression coefficients?
Page 3 of 4
(b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship
between the variables.
(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe your
results. What are wˆ0, wˆ1, and wˆ2?
(d) How do wˆ0, wˆ1, and wˆ2 relate to their true values? Can you reject the null hypothesis
H0 : w1 = 0? How about the null hypothesis H0 : w2 = 0?
(e) Figure out the 95% CI on the coefficients via bootstrap (N = 1000). Plot the distribution,
give the 95% CI values (in terms of left and right percentiles), and then state whether w1 = 0
(null hypothesis from part d) seems reasonable?
7. Computational Problem: Interaction Effects
In this question, we will explore the association of total cholesterol, age, and diabetes using a
sample of n = 500 adults from the NHANES study. Our dataset includes demographic and health
variables for each adult – include their cholesterol level in mmol/L TotChol, age Age, and diabetes
status Diabetes, coded as either 1 (yes) or 0 (no).
(a) Load the dataset. Fit a linear regression model for predicting total cholesterol level from age
and diabetes status.
i. Print a summary of the fitted linear model. Explain the values you get for (1) R-squared;
(2) std-error; and (3) P > |t|.
ii. Write an expression for the linear model.
iii. Write expressions of the linear model for diabetic individuals and one for non-diabetic
individuals.
iv. Create a scatterplot of age (x-axis) vs total cholesterol (y-axis). Plot the lines from part
(iii). Describe the relationship between these models.
v. Add confidence intervals to the lines in your plot.
vi. Add prediction intervals to the lines in each plot. How do they relate to confidence
intervals?
vii. Plot the residuals against Age. Can you support the probabilistic assumptions in your
case?
Linear regression models can allow the relationship between the outcome and a specific ex-
planatory variable to change based on other variables through interaction terms. Consider the
model:
E(TotChol) = w0 + w1(Age) + w2(Diabetes) + w3(Diabetes×Age) + ε.
Here, (Diabetes×Age) is an interaction term between diabetes and age, and w3 is its coefficient.
(b) Repeat the analysis in the previous steps. Use your results to evaluate whether you should
include an interaction term or not.
Page 4 of 4