STAT7038 Regression Modelling Semester 1 2025 Week 1 Course Information Teaching staff • Lecturer: Yuan Gao (yuan.gao@anu.edu.au) • Consultation: Wednesday 2-3 pm (Zoom link) or after each lecture • Tutors: • Houren Hong (head tutor): houren.hong@anu.edu.au • Ruby Turner: ruby.turner@anu.edu.au • Leo (Tiancheng) Huang: tiancheng.huang@anu.edu.au • Chuang Xu: chuang.xu@anu.edu.au • Rui Shen: rui.shen@anu.edu.au • Tutors’ consultation hours to be updated on Wattle. Zoom • You need to use the school account to log in Zoom • Need installation • See information on: • https://services.anu.edu.au/information-technology/software- systems/anu-zoom-client Communication • Use consultation time. • Use the discussion forum on wattle. • To be fair for all students, the lecturer will read but may not be able to reply emails about questions related to course materials (except for private questions). Instead, please paste the questions in the discussion forum. The lecturer and tutor will reply in the discussion forum. • Please get in touch with the lecturer for issues and concerns including grades, illness, falling behind, and academic accessibility issues. Tutorials • Begin in Week 2. • You should read through the tutorial sheet and think and attempt the questions before going to the tutorials. • Best opportunity to learn skills and techniques that will be required in the assignments. • Your tutors are your main source for help. Textbook • The required textbook for this course is: Applied Linear Regression Models 4th ed by Michael H Kutner • Free ebook from ANU library: link on Wattle. • Linear Models with R by Julian J. Faraway is another good resource. Wattle site • Access to all enrolled students • Course announcements • Lecture resources • Echo360 lecture recordings • Data sets • Tutorial questions, selected solutions • Assessments • Please check this site frequently! Assessment • Must complete independently! Assessment Task Value Due Date Online Quiz 5% Week 5 Assignment 1 15% Week 6 Assignment 2 15% Week 11 Final Examination 65% Central Exam Period Introduction to R and RStudio R and RStudio • R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. • Please see the course website for installation instructions for R and RStudio (suggest choosing English in language options). • You may attempt Tutorial Week 2 - Intro to R before your first tutorial. • Learn R cheatsheet, p1-4. • This course ≠R: The more important thing in this course is to understand statistical concepts. Your R project • Set your working directory. • Write your code in R script file. • Import data from an external file. • “read” functions • How to get help in R ? • ? Or ?? • Google! Data types in R • Three basic types: • Numeric (numbers) • Character (names) • logical (TRUE / FALSE) Data structures • R operates on named data structures. • Vector is a single entity consisting of an ordered collection of numbers. • Matrices or more generally arrays are multi-dimensional generalisations of vectors. • Lists are a general form of vector in which the various elements need not be of the same type and are often themselves vectors or lists. • Data frames are matrix-like structures, in which the columns can be of different types. Think of data frames as ”data matrices” with one row per observation but with (possibly) both numerical and categorical variables. • Give meaningful names to your data. R packages • In R, a package is a structured collection of R functions, data, and compiled code that enhances the capabilities of the base R environment. • R comes with a standard set of packages, and many more are available for download and installation. Once installed, a package's contents can be made available in the current R session by loading the package. • install.packages("x") • Library(“x”) • The Comprehensive R Archive Network (CRAN) is a central repository that hosts a vast array of R packages contributed by users worldwide. Revision on Basic Statistics Population & sample • Population (True world) • A collection of the whole of something • Parameters: true values describing the population • Eg: , !, • Unknown • Sample (Your subjective world) • A set of individuals drawn from a population • Statistics: calculated from the sample served as estimates of the parameters • Eg: $, !, • Known Properties of estimators • Random variables ( ") • Probability distribution • " = " = • Central Limit Theorem (CLT) • ! is asymptotically normally distributed • Make inferences • Confidence interval • Hypothesis testing Linear Regression Regression analysis • Statistical methodology that utilises the relation between two or more quantitative variables to that a response or outcome variable can be predicted from the other (or others) . • This methodology is widely used in business, the social and behavioural sciences, the biological sciences, and many other disciplines. Regression analysis • Examples • Predict sales of a product using the relationship between sales and the amount spent on advertising. (SLR) • Predict performance of employee using relationship between performance and aptitude test. (SLR) • Predict the size of the vocabulary of a child using the relationship between the size of vocabulary and the age of the child and the amount of education of the parents. (MLR) Relation between Variables • We should distinguish between functional relation and a statistical relation between variables. • A functional relation between two variables is expressed as mathematical formula, = () • A functional relation is a “perfect” mapping from X to Y . • A statistical relation is not perfect. The observations do not fall directly on the curve of relationship and they are typically scattered around this curve. Relation between Variables Regression models • Historical Origins • The term regression was first used by Francis Galton in the late 19th century to explain a biological phenomenon he observed: “regression towards the mean” . • The height of children of both tall and short parents appeared to “revert” or “regress” to the mean of the group. Galton Families Dataset • This data set lists the individual observations for 934 children in 205 families on which Galton (1886) based his cross-tabulation. • How to formally describe the relationship? Construction of regression models • Selection of variables • X: Independent variable, predictor, regressor, covariate • Y: Dependent variable, response, outcome, output • Only a limited number of useful covariates should be included in the regression model • How do you choose? Through exploratory studies, theory, etc. Construction of regression models • Functional form of regression relation • Choice of in the functional form = () is tied to the choice of covariate(s). • Sometimes the relevant theory may indicate the appropriate form for . • Typically needs to be determined empirically from the data. Scatter plot may help. • Linear or quadratic regression functions are often a good first approximation. Construction of regression models • Scope of model • We usually need to restrict the coverage of the model to some interval or region of values. • The scope is determined either by the design of the investigation or by the range of data at hand. • The model may perform badly given previously unobserved data. Use of regression • Regression serves three major purposes: • Description (How one variable influence the other) • Control (Set standards, monitor operations, etc.) • Prediction (Given new observations) Regression and causality • Existence of a statistical relation between response and covariate does not imply in any way that depends causally on • (correlation ≠causation) • Funny examples? • High ice-cream sales lead to high drowning cases? • Reverse causality: leads to or leads to ? • To reach causality conclusions, experimental studies should be conducted. Simple linear regression model (SLR) Formal statement of the SLR • One predictor variable • Linear • ! = " + # ! + ! , = 1, … , • Where • " : the value of the response variable in the th trial. • # and $ are parameters. • ": a known constant, the covariate value in the th trial. • ": a random error term with mean 0 and variance ! for all . • " and % are uncorrelated for all ≠ . Important features of SLR • The response ! is a random variable since it is the sum of two components: • The constant term " + # ! • The random error !. • Since ! = 0, it follows that ! = " + # ! + ! = " + # ! + ! = " + # ! Important features of SLR • ! is a random variable which probability distribution has a mean value ! = " + # ! • It is more reasonable to describe the linear regression model as() = " + # • ! is a random variable which probability distribution has a variance ! = " + #! + ! = ! = $ • Our model assumes that !%s come from a probability distribution with mean " + #! and variance $. The distribution of ! Regression parameters • The parameters are called regression coefficients • The intercept: " • The slope: # • The slope gives the change in the mean of per unit increase in • The intercept (when the scope of the model includes = 0) gives the mean of the probability distribution at = 0 Fitting the model • Data generated from a true model: (unknown)! = " + # ! + ! • What we observe: • Only pairs of values #, # , $, $ , … , &, & • Find the best estimated model:5! = " + #! , • meaning finding a straight line that is “closest” to all the observed data points. Fitting the model Fitting the model—Method of least squares • What’s the “closest” straight line to all observed data points? • For the observations (! , !) for each case, we consider the deviation of ! from its expected value: ! − (" + #!) • The method of least squares considers the sum of the n squared deviations. = +!$#% ! − ("+#! )& • The estimators of " and # are the values " and # that minimise given the observation pairs #, # , &, & , … , %, % . Fitting the model—Method of least squares Properties of LS estimators • Unbiased ["] = ", [#] = # • Minimum variance • More precise/efficient than other unbiased estimator. More than fitting a model – what needs to be considered in real practice? • What is your question of interest? • Statistical formulation of the question. • Source of the data • Sample size, data cleaning like combing data from different recourses, checking missing data, data mining (too many variables) • Exploratory Data Analysis • Summary statistics, boxplots, histograms, scatterplots, etc • What model should be used? • Linear/non-linear, simple regression/multiple regression • Fitting a model is the easy part. • Consider appropriateness of the model. • Ensuring the assumptions are met. • Diagnostics for a model to check for validity and significance. • Remedies for violations of assumptions. • Finally, make inferences and predictions More than fitting a model – what needs to be considered in real practice? • Read Ch 1.1-1.6 of the textbook.
学霸联盟