WEEK 1-stat2008代写
时间:2023-05-27
REGRESSION MODELLING
WEEK 1
Abhinav Mehta
REVISION OF BASIC STATISTICS
1 / 33
POPULATION AND SAMPLE
Population (True world)
• A collection of the whole of something you are interested in.
• Parameters: true values describing the population, for example
µ, σ2. Usually they are unknown.
Sample (Your subjective world)
• A set of individuals randomly drawn from a population.
• Statistics: calculated from the sample served as estimators of
the parameters, for example X¯ , S2.
2 / 33
PROPERTIES OF ESTIMATORS
• It is a random variables, e.g., X¯ .
• It has a probability distribution, often called sampling distribution.
• E(X¯ ) = µ, Var(X¯ ) = σ2/n.
• Central Limit Theorem (CLT): X¯ is asymptotically normally
distributed.
• Make statistical inferences: confidence interval and hypothesis
testing.
3 / 33
REGRESSION
4 / 33
WHAT IS REGRESSION?
• Statistical methodology that describes the relation between two
or more variables so that a response or outcome variable can be
estimated from the other explanatory variables.
• This methodology is widely used in business, the social and
behavioural sciences, the biological sciences, and many other
disciplines.
5 / 33
WHAT IS REGRESSION?
Examples
• Predict sales of a product using relationship between sales and
amount spent on advertising (SLR).
• Predict performance of employee using relationship between
performance and aptitude test (SLR).
• Predict the size of the vocabulary of a child using relationship
between the size of vocabulary and the age of the child and
amount of education of the parents (MLR).
• Does the price of a house increase with increase in living area?
(MLR)
6 / 33
RELATION BETWEEN VARIABLES
We should distinguish between functional relation and a statistical
relation between variables.
• A functional relation between two variables is expressed as a
mathematical formula,
Y = f (X ).
A functional relation is a “perfect” mapping from X to Y .
• A statistical relation is not perfect. The observations do not fall
directly on the curve of relationship and they are typically
scattered around this curve.
7 / 33
RELATIONSHIP BETWEEN VARIABLES
8 / 33
REGRESSION MODELS
Historical Origins
• The term regression was first used by Francis Galton in the late
19th century to explain a biological phenomenon he observed:
“regression towards the mean”.
• The height of children of both tall and short parents appeared to
“revert” or “regress” to the mean of the group.
9 / 33
GALTON’S DATASET
This data set lists the individual observations for 934 children in
205 families on which Galton (1886) based his cross-tabulation.
How to formally describe the relationship?
10 / 33
CONSTRUCTION OF REGRESSION MODELS
Selection of Variables
• X : Independent variable, predictor, regressor, covariate.
• Y : Dependent variable, response, outcome, output.
• Only a limited number of useful covariates should be included in
the regression model.
• How do you choose? Through exploratory studies, theory, etc.
11 / 33
CONSTRUCTION OF REGRESSION MODELS
Functional Form of Regression Relation
• Choice of f in the functional form Y = f (X ) is tied to the choice
of covariate(s).
• Sometimes the relevant theory may indicate the appropriate
form for f .
• Typically needs to be determined empirically from the data.
Scatter plot may help.
• Linear or quadratic regression functions are often a first good
approximation.
12 / 33
CONSTRUCTION OF REGRESSION MODELS
Scope of Model
• We usually need to restrict the coverage of the model to some
interval or region of values.
• The scope is determined either by the design of the investigation
or by the range of data at hand.
• The model may perform badly given previously unobserved data.
13 / 33
USE OF REGRESSION
Regression serves three major purposes:
• Description (How one variable influences the other).
• Control (Set standards, monitor operations, etc).
• Prediction (based on a new observations).
14 / 33
REGRESSION AND CAUSALITY
• Existence of a statistical relation between response Y and
covariate X does not imply in any way that Y depends causally
on X .
• Funny examples but, misuse of causality?
• High ice-cream sales lead to high drowning cases?
• Reverse causality: X leads to Y or Y leads to X?
• To avoid making such mistakes, we may need to control for
some confounders (e.g. temperature), which are related to both
Y and X .
15 / 33
SIMPLE LINEAR REGRESSION (SLR)
16 / 33
MODEL STRUCTURE OF SLR
• Only one covariate in the model.
• Linear:
Yi = β0 + β1Xi + εi
where
• Yi : the value of the response variable for the i-th
observation;
• β0 and β1 are unknown parameters to be estimated;
• Xi : a known constant, the covariate value for the i-th
observation;
• εi : a random error term with mean 0 and variance σ2 for all
i ;
• εi and εj are uncorrelated for all i 6= j .
17 / 33
MODEL FEATURES OF SLR
The response Yi is a random variable as it is sum of two components:
• the constant term β0 + β1Xi ,
• the random term εi .
Since E(εi) = 0, we have
E(Yi) = E(β0 + β1Xi + εi) = β0 + β1Xi + E(εi) = β0 + β1Xi ,
Var(Yi) = Var(β0 + β1Xi + εi) = Var(εi) = σ2.
So the probability distribution of Yi has a mean value β0 + β1Xi and
variance of σ2.
18 / 33
MODEL FEATURES OF SLR
• We actually assume a linear relationship between the mean
response and the covariate
E(Yi) = β0 + β1Xi .
• Our model assumes that all the responses Yi ’s come from a
probability distribution with mean β0 + β1Xi and variance of σ2.
• Since the error terms εi and εj are uncorrelated, this implies that
so are Yi and Yj if i 6= j .
19 / 33
DISTRIBUTION OF Yi
20 / 33
REGRESSION PARAMETERS
• The parameters are called regression coefficients:
• The intercept: β0,
• The slope: β1.
• The slope gives the change in the mean of the probability
distribution of Y per unit increase in X .
• The intercept, when the scope of the model includes X = 0,
gives the mean of the probability distribution of Y at X = 0.
21 / 33
BEFORE FITTING A MODEL
What is your question of interest?
• Statistical formulation of the question.
Source of the data
• Sample size, data cleaning like combing data from different
resources, checking missing data, etc.
Exploratory Data Analysis
• Summary statistics, boxplots, histograms, scatterplots, etc.
22 / 33
SCATTERPLOT
23 / 33
MODEL FITTING
Data generated from a true model: (unknown)
Yi = β0 + β1Xi + εi .
What we observe:
• Only n pairs of values (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
Find the best estimated model:
Yˆi = b0 + b1Xi ,
meaning finding a straight line that is “closest” to all the observed
data points.
24 / 33
SCATTERPLOT
25 / 33
MODEL FITTING - LEAST SQUARES ESTIMATION
For each observation pair (Xi ,Yi) we consider the deviation/distance
of Yi from its expected value, Yi − E(Yi) given by
Yi − (β0 + β1Xi).
The method of “least squares” considers the sum of the n squared
deviations
Q =
n∑
i=1
(Yi − (β0 + β1Xi))2
The estimators of β0 and β1 are the values b0 and b1 that minimise Q
given the observation pairs (X1,Y1), (X2,Y2), · · · , (Xn,Yn).
26 / 33
MODEL FITTING - LEAST SQUARES (LS) ESTIMATION
Differentiating Q with respect to β0 and β1, we obtain:
∂Q
∂β0
= −2
n∑
i=1
(Yi − β0 − β1Xi )
∂Q
∂β1
= −2
n∑
i=1
Xi (Yi − β0 − β1Xi ) .
We then set these partial derivatives equal to zero, using b0 and b1
to denote the particular values of β0 and β1 that minimize Q, i.e.,
n∑
i=1
(Yi − b0 − b1Xi ) = 0
n∑
i=1
Xi (Yi − b0 − b1Xi ) = 0.
The above equation is so called “normal equations”. Finally, we get
b1 =
∑n
i=1
(
Xi − X¯
) (
Yi − Y¯
)∑n
i=1
(
Xi − X¯
)2 = SxySxx , b0 = Y¯ − b1X¯ .
27 / 33
MODEL FITTING - LEAST SQUARES (LS) ESTIMATION
28 / 33
MODEL FITTING - LEAST SQUARES (LS) ESTIMATION
29 / 33
PROPERTIES OF LS ESTIMATORS
Unbiased
• E(b0) = β0, E(b1) = β1.
Minimum variance
• More precise/efficient than other unbiased linear estimator.
30 / 33
MORE THAN FITTING A MODEL
• Fitting a model is the easy part.
• Consider appropriateness of the model.
• Ensuring the assumptions are met.
• Diagnostics for a model to check for validity and significance.
• Remedies for violations of assumptions.
• Finally, make inferences and prediction.
31 / 33
PITFALLS IN LINEAR REGRESSION
• Is linear model the right model based on theory?
• Correlation does not mean causation.
• Omitted variable bias
• Study finds “Golfers more prone to heart disease, cancer
and arthritis”.
• Modelling mistake: the effect of age was omitted.
• Multicollinearity:
• Child’s education performance predicted by ‘mother’s
education’ and ‘father’s education’.
• Extrapolating beyond the data.
• Data mining (too many variables).
32 / 33
NEXT STEPS FOR YOU...
Read Ch 1.1-1.6 of the textbook.
essay、essay代写