ECON3203 Econometric Theory and Methods
Linear Regression 1
A/Prof. Minh-Ngoc Tran
Table of content
1. Simple Linear Regression
2/22
Recommended reading
• Chapter 3, An Introduction to Statistical Learning with
Applications in R by James et al.: easy to read, sloppy
discussions sometimes, comes with R/Python code for
practice.
• Chapters 3, The Elements of Statistical Learning by Hastie et
al.: well-written, deep in theory, suitable for students with a
sound maths background.
3/22
Spending Data
The Spending Data contains information on 1000 customers in a
customer database for the company Direct Marketing (DM). The
data consist of
• AmountSpent ($): the amount spent by each customer in one
year on DM products.
• Salary ($)
• Catalogs: number of shopping catalogs sent to each
customer per year
• Children: number of children each customer has
• and many others
The company would like to build a model to explain/predict the
amount each customer spending on its products, based on the
other variables.
4/22
Spending Data
5/22
Spending Data
Questions might be asked
• Is there a relationship between the spending amount and the
number of catalogs, for example?
• How is the salary associate with the spending amount?
• How are the salary and the number of children interactively
associate with the spending amount?
• What is the best subset of the explanatory variables in terms
of predicting the spending amount?
• Given a particular customer, how much spending amount the
company would expect from he/she?
• ...
6/22
Linear Regression
Y = β0 + β1X1 + ...+ βpXp + ϵ
• Y : the response variable or dependent variable to be predicted
• X1, ..., Xp: potential predictors or covariates or independent
variables
• the intercept β0 and slope coefficients β1, ..., βp are unknown
population parameters to be estimated
• ϵ: error term or something that can’t be explained by the
model. Assume E(ϵ) = 0 and V(ϵ) = σ2, σ2 is also unknown
Caution: the model imposes many assumptions that need to be
checked/testified! Y should be continuous, not suitable for
categorical/binary response variables. The Xj can be both
continuous/discrete or categorical.
7/22
Linear Regression
Y = β0 + β1X1 + ...+Xp + ϵ
• Linear regression might sound like too simplistic, it’s hardly
true that Y is linearly dependent on the Xj ’s
• But linear regression is extremely useful and important. It
forms a basic framework for non-linear regression and many
advanced regression models
• The linearity is in terms of the coefficients βj ’s, not the Xj ’s.
E.g.,
Y = β0 + β1X1 + β2eX2 + β3X31 + β4X1X2 + ϵ
is still a linear regression model!
• Precisely, X1, X2 are called covariates. In this model, X1, X31 ,
eX2 and X1X2 are called predictors or features
8/22
Simple Linear Regression
Simple Linear Regression
Y = β0 + β1X + ϵ, E(ϵ) = 0, V(ϵ) = σ2
So the conditional mean of Y given X = x is a linear function of x
µY |X=x = E(Y |X = x) = β0 + β1x
Our working example
AmountSpent = β0 + β1Catalogs+ ϵ
9/22
Simple Linear Regression
Our working example
AmountSpent = β0 + β1Catalogs+ ϵ
Let β̂0 and β̂1 be estimates of β0 and β1 respectively. Then
µ̂Y |X=x = β̂0 + β̂1x
is a point estimate of E(Y |X = x). This is an estimate of the
average spending amount among all customers who are sent x
catalogs per year.
We also use ŷ = β̂0 + β̂1x to predict the spending amount of an
individual customer who is sent x catalogs.
10/22
Estimating the coefficients by the least squares method
• ŷi = β̂0 + β̂1xi is the prediction of the observation yi when
X = xi. So ei = yi − ŷi represents the ith residual
• We define the residual sum of squares (RSS) as
RSS(β̂0, β̂1) =
n∑
i=1
e2i =
n∑
i=1
(
yi − (β̂0 + β̂1xi)
)2
.
Note that {yi, xi, i = 1, ..., n} is the training dataset.
• The best β̂0 and β̂1 will be the ones that minimise
RSS(β̂0, β̂1).
• This is known as Least Squares Method. No probability
distribution of ϵ is needed.
11/22
Estimating the coefficients by the least squares method
• β̂0 = ...
• β̂1 = ...
12/22
Estimating the coefficients by the least squares method
β̂0 = y¯ − β̂1x¯
β̂1 =
∑(xi − x¯)(yi − y¯)∑(xi − x¯)2 = x¯y − x¯y¯x¯2 − (x¯)2
13/22
Estimation by the maximum likelihood method
• Assume that ϵi ∼ N (0, σ2), i = 1, ..., n. Note that
yi = β0 + β1xi + ϵi
Therefore yi ∼ N (β0 + β1xi, σ2).
• E.g., the amounts spent by customers who are sent x catalogs
are normally distributed with mean β0 + β1x and variance σ2.
Seems a reasonable assumption!
• The likelihood function is
p(y|β0, β1, σ2) =
n∏
i=1
1√
2πσ2
exp
(
−(yi − β0 − β1xi)
2
2σ2
)
• Maximising this likelihood with respect to β0 and β1 leads
exactly to the same least squares estimates β̂0 and β̂1
• Estimate of σ2: σ̂2= 1n
∑n
i=1
(
yi−(β̂0+β̂1xi)
)2
.
14/22
Brief introduction to Maximum Likelihood Estimation
Given data y = {y1, ..., yn}. In almost all areas of applications, we
then assume the data come from a statistical model, depending on
a vector of unknown parameters θ. This model allows us to write
down the density p(yi|θ) of yi.
Example. yi iid∼ N (µ, 1), i = 1, 2, . . . , n. The joint density of y is
p(y|µ) =
n∏
i=1
p(yi|µ) =
n∏
i=1
1√
2π
e−
(yi−µ)2
2
=
( 1√
2π
)n
exp
[
−12
n∑
i=1
(yi − µ)2
]
This function, considered as a function of µ, measures how likely a
value of µ is as the underlying parameter that generated the data
{y1, . . . , yn}
15/22
Brief introduction to Maximum Likelihood Estimation
Let y = {y1, . . . , yn} be a random sample from a distribution with
pdf p(y|θ). The likelihood function, as a function of θ, is defined as
p(y|θ) =
n∏
i=1
p(yi|θ).
• This likelihood function reflects the probability of observing
the data y if θ is the true parameter
• We wish to estimate the (unknown) true value of θ that
generated the data y by those values that maximise p(y|θ).
• The maximum likelihood estimator (MLE) of θ is a value that
maximises p(y|θ).
• MLE is one of the most popular estimation methods.
16/22
Maximum Likelihood Estimator
Exercise. Suppose that {x1 = 5, 0, 1, 1, 0, 3, 2, 3, 4, x10 = 1} are
n = 10 observations from the Poisson distribution with pdf
f(y|θ) = e
−θθy
y! .
• Write down the log-likelihood function for the sample.
• Find the MLE of θ
17/22
Spending data example
18/22
Properties of the estimators
Under the assumption that ϵi ∼ N (0, σ2), the distribution of
β̂ = (β̂0, β̂1)′ is multivariate normal with
mean
(
β0
β1
)
, and covariance σ
2
n
(
x2/s2x −x/s2x
−x/s2x 1/s2x
)
where
x = 1
n
∑
xi, x2 =
1
n
∑
x2i , s
2
x =
1
n
∑
(xi − x)2 = x2 − (x)2
That is
β̂0 ∼ N
(
β0, σ
2 x
2
ns2x
)
, β̂1 ∼ N
(
β1,
σ2
ns2x
)
The standard errors of the estimators are
se(β̂0) =
σ
sx
√
x2
n
, se(β̂1) =
σ√
nsx
19/22
Uncertainty of the estimates
σ is estimated by
σ̂ =
√
1
n
∑
(yi − ŷi)2
So the standard errors of β̂0 and β̂1 are estimated by
ŝe(β̂0) =
σ̂
sx
√
x2
n
, ŝe(β̂1) =
σ̂√
nsx
The 100%(1-α) confidence interval for βj is
β̂j ± tn−2,α/2ŝe(β̂j), j = 0, 1.
Here tν,p denotes the number such that
P(tν > tν,p) = p, p ∈ [0, 1]
where tν is a random variable with Student’s t distribution with ν
degrees of freedom.
20/22
Testing about the population coefficients
Question of interest: is there a relationship between Salary and
AmountSpent?
H0 : β1 = 0 v.s. H1 : β1 ̸= 0
• Test statistic
tstat =
β̂1 − 0
ŝe(β̂1)
• Under the null hypothesis H0, tstat ∼ tn−2, i.e. it follows the
Student’s t distribution with n− 2 degrees of freedom.
• Statistical software often reports p-value
p-value = P(tn−2 > |tstat|)
• The smaller this p-value, the more evidence is against H0.
Typical cutoff p-value: 0.05 or 0.01.
21/22
Spending data example
Similarly, we can test H0 : β0 = 0 v.s. H1 : β0 ̸= 0. But this is not
statistically interesting!
22/22