STAT7038 -无代写|学霸联盟

STAT7038 -无代写

时间：2025-03-13

STAT7038
Regression Modelling
Semester 1 2025
Week 1
Course Information
Teaching staff
• Lecturer: Yuan Gao (yuan.gao@anu.edu.au)
• Consultation: Wednesday 2-3 pm (Zoom link) or after each lecture
• Tutors:
• Houren Hong (head tutor): houren.hong@anu.edu.au
• Ruby Turner: ruby.turner@anu.edu.au
• Leo (Tiancheng) Huang: tiancheng.huang@anu.edu.au
• Chuang Xu: chuang.xu@anu.edu.au
• Rui Shen: rui.shen@anu.edu.au
• Tutors’ consultation hours to be updated on Wattle.
Zoom
• You need to use the school account to log in Zoom
• Need installation
• See information on:
• https://services.anu.edu.au/information-technology/software-
systems/anu-zoom-client
Communication
• Use consultation time.
• Use the discussion forum on wattle.
• To be fair for all students, the lecturer will read but may not be able to
reply emails about questions related to course materials (except for
private questions). Instead, please paste the questions in the discussion
forum. The lecturer and tutor will reply in the discussion forum.
• Please get in touch with the lecturer for issues and concerns including
grades, illness, falling behind, and academic accessibility issues.
Tutorials
• Begin in Week 2.
• You should read through the tutorial sheet and think and attempt the
questions before going to the tutorials.
• Best opportunity to learn skills and techniques that will be required in
the assignments.
• Your tutors are your main source for help.
Textbook
• The required textbook for this course is: Applied Linear Regression
Models 4th ed by Michael H Kutner
• Free ebook from ANU library: link on Wattle.
• Linear Models with R by Julian J. Faraway is another good resource.
Wattle site
• Access to all enrolled students
• Course announcements
• Lecture resources
• Echo360 lecture recordings
• Data sets
• Tutorial questions, selected solutions
• Assessments
• Please check this site frequently!
Assessment
• Must complete independently!
Assessment Task Value Due Date
Online Quiz 5% Week 5
Assignment 1 15% Week 6
Assignment 2 15% Week 11
Final Examination 65% Central Exam Period
Introduction to R and RStudio
R and RStudio
• R is a programming language and free software environment
for statistical computing and graphics supported by the R Foundation for
Statistical Computing.
• Please see the course website for installation instructions for R and
RStudio (suggest choosing English in language options).
• You may attempt Tutorial Week 2 - Intro to R before your first tutorial.
• Learn R cheatsheet, p1-4.
• This course ≠R: The more important thing in this course is to understand
statistical concepts.

Your R project
• Set your working directory.
• Write your code in R script file.
• Import data from an external file.
• “read” functions
• How to get help in R ?
• ? Or ??
• Google!
Data types in R
• Three basic types:
• Numeric (numbers)
• Character (names)
• logical (TRUE / FALSE)
Data structures
• R operates on named data structures.
• Vector is a single entity consisting of an ordered collection of numbers.
• Matrices or more generally arrays are multi-dimensional generalisations of
vectors.
• Lists are a general form of vector in which the various elements need not be
of the same type and are often themselves vectors or lists.
• Data frames are matrix-like structures, in which the columns can be of
different types. Think of data frames as ”data matrices” with one row per
observation but with (possibly) both numerical and categorical variables.
• Give meaningful names to your data.
R packages
• In R, a package is a structured collection of R functions, data, and
compiled code that enhances the capabilities of the base R environment.
• R comes with a standard set of packages, and many more are available
for download and installation. Once installed, a package's contents can
be made available in the current R session by loading the package.
• install.packages("x")
• Library(“x”)
• The Comprehensive R Archive Network (CRAN) is a central repository
that hosts a vast array of R packages contributed by users worldwide.
Revision on Basic Statistics
Population & sample
• Population (True world)
• A collection of the whole of
something
• Parameters: true values
describing the population
• Eg: , !,
• Unknown
• Sample (Your subjective world)
• A set of individuals drawn from a
population
• Statistics: calculated from the
sample served as estimates of the
parameters
• Eg: $, !,
• Known
Properties of estimators
• Random variables ( ")
• Probability distribution
• " = " =
• Central Limit Theorem (CLT)
• ! is asymptotically normally distributed
• Make inferences
• Confidence interval
• Hypothesis testing
Linear Regression
Regression analysis
• Statistical methodology that utilises the relation between two or more
quantitative variables to that a response or outcome variable can be
predicted from the other (or others) .
• This methodology is widely used in business, the social and behavioural
sciences, the biological sciences, and many other disciplines.
Regression analysis
• Examples
• Predict sales of a product using the relationship between sales and the
amount spent on advertising. (SLR)
• Predict performance of employee using relationship between performance
and aptitude test. (SLR)
• Predict the size of the vocabulary of a child using the relationship between
the size of vocabulary and the age of the child and the amount of education
of the parents. (MLR)
Relation between Variables
• We should distinguish between functional relation and a statistical
relation between variables.
• A functional relation between two variables is expressed as mathematical
formula, = ()
• A functional relation is a “perfect” mapping from X to Y .
• A statistical relation is not perfect. The observations do not fall directly
on the curve of relationship and they are typically scattered around this
curve.
Relation between Variables
Regression models
• Historical Origins
• The term regression was first used by Francis Galton in the late 19th century
to explain a biological phenomenon he observed: “regression towards the
mean” .
• The height of children of both tall and short parents appeared to “revert” or
“regress” to the mean of the group.
Galton Families Dataset
• This data set lists the individual observations for 934 children in 205
families on which Galton (1886) based his cross-tabulation.
• How to formally describe the relationship?
Construction of regression models
• Selection of variables
• X: Independent variable, predictor, regressor, covariate
• Y: Dependent variable, response, outcome, output
• Only a limited number of useful covariates should be included in the
regression model
• How do you choose? Through exploratory studies, theory, etc.
Construction of regression models
• Functional form of regression relation
• Choice of in the functional form = () is tied to the choice of
covariate(s).
• Sometimes the relevant theory may indicate the appropriate form for .
• Typically needs to be determined empirically from the data. Scatter plot may
help.
• Linear or quadratic regression functions are often a good first approximation.
Construction of regression models
• Scope of model
• We usually need to restrict the coverage of the model to some interval or
region of values.
• The scope is determined either by the design of the investigation or by the
range of data at hand.
• The model may perform badly given previously unobserved data.
Use of regression
• Regression serves three major purposes:
• Description (How one variable influence the other)
• Control (Set standards, monitor operations, etc.)
• Prediction (Given new observations)
Regression and causality
• Existence of a statistical relation between response and covariate
does not imply in any way that depends causally on
• (correlation ≠causation)
• Funny examples?
• High ice-cream sales lead to high drowning cases?
• Reverse causality: leads to or leads to ?
• To reach causality conclusions, experimental studies should be
conducted.
Simple linear regression model
(SLR)
Formal statement of the SLR
• One predictor variable
• Linear
• ! = " + # ! + ! , = 1, … ,
• Where
• " : the value of the response variable in the th trial.
• # and $ are parameters.
• ": a known constant, the covariate value in the th trial.
• ": a random error term with mean 0 and variance ! for all .
• " and % are uncorrelated for all ≠ .
Important features of SLR
• The response ! is a random variable since it is the sum of two
components:
• The constant term " + # !
• The random error !.
• Since ! = 0, it follows that ! = " + # ! + ! = " + # ! + ! = " + # !
Important features of SLR
• ! is a random variable which probability distribution has a mean value ! = " + # !
• It is more reasonable to describe the linear regression model as() = " + #
• ! is a random variable which probability distribution has a variance ! = " + #! + ! = ! = $
• Our model assumes that !%s come from a probability distribution with
mean " + #! and variance $.
The distribution of !
Regression parameters
• The parameters are called regression coefficients
• The intercept: "
• The slope: #
• The slope gives the change in the mean of per unit increase in
• The intercept (when the scope of the model includes = 0) gives the
mean of the probability distribution at = 0
Fitting the model
• Data generated from a true model: (unknown)! = " + # ! + !
• What we observe:
• Only pairs of values #, # , $, $ , … , &, &
• Find the best estimated model:5! = " + #! ,
• meaning finding a straight line that is “closest” to all the observed data
points.
Fitting the model
Fitting the model—Method of least squares
• What’s the “closest” straight line to all observed data points?
• For the observations (! , !) for each case, we consider the deviation of !
from its expected value: ! − (" + #!)
• The method of least squares considers the sum of the n squared deviations.
= +!$#% ! − ("+#! )&
• The estimators of " and # are the values " and # that minimise given
the observation pairs #, # , &, & , … , %, % .
Fitting the model—Method of least squares
Properties of LS estimators
• Unbiased ["] = ", [#] = #
• Minimum variance
• More precise/efficient than other unbiased estimator.
More than fitting a model – what needs to be
considered in real practice?
• What is your question of interest?
• Statistical formulation of the question.
• Source of the data
• Sample size, data cleaning like combing data from different recourses,
checking missing data, data mining (too many variables)
• Exploratory Data Analysis
• Summary statistics, boxplots, histograms, scatterplots, etc
• What model should be used?
• Linear/non-linear, simple regression/multiple regression
• Fitting a model is the easy part.
• Consider appropriateness of the model.
• Ensuring the assumptions are met.
• Diagnostics for a model to check for validity and significance.
• Remedies for violations of assumptions.
• Finally, make inferences and predictions
More than fitting a model – what needs to be
considered in real practice?
• Read Ch 1.1-1.6 of the textbook.

学霸联盟