r代写-MATH263|学霸联盟

r代写-MATH263

时间：2021-10-03

MATH263: Statistical Theory and Methods 1 Lecture Notes Contents About 3 1 Distributions, Means and Variances 4 1.1 Revision: random variables, distribution functions, normal distribution . . . . . . . . 4 1.2 Mean and variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Chi-squared distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Student t-distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 F -distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.6 Notation of critical values and tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2 Estimators and Sampling Distributions 12 3 Confidence intervals for a population mean 15 3.1 Confidence interval for µ when variance is known . . . . . . . . . . . . . . . . . . . . 15 3.2 Confidence interval for µ when variance is unknown . . . . . . . . . . . . . . . . . . . 17 3.3 Confidence intervals for large samples, robustness . . . . . . . . . . . . . . . . . . . . 18 4 The Concept of Hypothesis Testing 20 4.1 Hypothesis testing – terminology and key steps . . . . . . . . . . . . . . . . . . . . . 20 4.2 Using critical values for hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Type I and Type II errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5 Tests for a population mean 26 5.1 Z-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.2 t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6 More on Hypothesis Testing 30 6.1 Power and Sample Size Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2 Practical Significance vs Statistical Significance . . . . . . . . . . . . . . . . . . . . . 32 6.3 Confidence Intervals and Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . 33 7 Two Sample Tests for Means 34 7.1 Paired t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 7.2 Independent two sample tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7.2.1 Known common standard deviation – independent two-sample Z-test . . . . . 38 7.2.2 Unknown common standard deviation – independent two-sample t-test . . . . 39 7.3 Paired versus independent tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 1 8 Inference for Variances 43 8.1 Inference for variance of a single population . . . . . . . . . . . . . . . . . . . . . . . 43 8.2 Comparison of two population variances . . . . . . . . . . . . . . . . . . . . . . . . . 45 9 Normal probability plot and normality test 49 10 Comparison of k population means: ANOVA 52 10.1 Introducing ANOVA Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 10.2 ANOVA Hypothesis Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 10.3 ANOVA Examples and Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 10.3.1 ANOVA Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 10.3.2 Terminology and further notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 10.3.3 ANOVA Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 10.4 ANOVA Assumptions and How to Check Them . . . . . . . . . . . . . . . . . . . . . 63 11 Post Hoc Tests 65 11.1 Fisher’s Least Significant Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 11.2 Multiple tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 11.3 Bonferroni correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 11.4 Tukey’s HSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 12 Simple Linear Regression 70 12.1 Simple Linear Regression Model and Estimating Parameters . . . . . . . . . . . . . . 70 12.2 Distributions of β̂0, β̂1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 12.3 Confidence Intervals for β0, β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 12.4 Confidence Intervals for a Mean Response and Prediction Intervals . . . . . . . . . . . 82 12.5 Hypothesis Testing in Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . 85 12.6 Model assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 12.7 Multiple Linear Regression and Model Transformations . . . . . . . . . . . . . . . . . 92 13 Non-parametric Tests 94 13.1 The Sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 13.2 Wilcoxon signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 13.3 Wilcoxon rank sum test (Mann-Whitney U -test) . . . . . . . . . . . . . . . . . . . . . 102 13.4 Parametric versus non-parametric tests . . . . . . . . . . . . . . . . . . . . . . . . . . 103 2 About These notes are a printable version of the online text content in the MATH263 online learning on Mo¨bius. Please note that these notes do not contain videos or quizzes, however you can find refer- ences ‘VIDEO’ and ‘QUIZ’ at the points where they appear in the MATH263 online learning on Mo¨bius. To watch the videos and to try the quizzes, please go to the MATH263 Canvas page, and then MATH263 Online Learning on Mo¨bius. Lecturer: Dr Alena Haddley – email: a.haddley@liverpool.ac.uk Recommended books: See reading list on Canvas. Recommendation: Complete the Open Statistics module on Canvas or revise all MATH163 ma- terial. Please go to the MATH263 Canvas page for important information about the module, assessment, online learning for this module, etc. 3 1 Distributions, Means and Variances 1.1 Revision: random variables, distribution functions, normal distri- bution In this section, we summarise concepts that should be already well known to you from previous years of your study. If you wish to revise these concepts in more detail, please go to the Open Statistics module, lesson Statistical Distributions. Random variables A random variable X is a variable whose value is a numerical outcome of a random event. There are two types of random variables: Discrete: The possible values ofX are separated and individually distinct, e.g. X ∈ {0, 1, 2, . . .} Continuous: Possible values of X from some continuous set, e.g. – X ∈ [0,∞), – X ∈ (−∞,∞), – X ∈ (0, 1). Example: Discrete random variables: outcome of a roll of a 6-sided die, that is X ∈ {1, 2, 3, 4, 5, 6}, number of heads in 3 flips of a coin, that is X ∈ {0, 1, 2, 3}, . . . Continuous random variables: the lifetime X of an electrical component with X ∈ [0,∞), height of individuals, time spent studying, . . . In this module, we will concentrate on continuous data. To display a distribution of continuous data, we draw a histogram. (Revise how to construct histogram, for example in the Open Statistics module in lesson Describing and Collecting Data.) Cumulative distribution function Definition: For any random variable (r.v.) X, the cumulative distribution function F (x) is F (x) = Pr(X ≤ x). Properties: 0 ≤ F (x) ≤ 1 for all x F (x) is non-decreasing lim x→−∞F (x) = 0 lim x→+∞F (x) = 1 4 Probability density function Definition: For a continuous random variable X, the probability density function f(x) is f(x) = dF dx . By the Fundamental Theorem of Calculus, F (x) = ∫ x −∞ f(u) du. Properties of pdf: f(x) ≥ 0 ∫ ∞ −∞ f(x) dx = 1 Probabilities For any a ≤ b, Pr(a ≤ X ≤ b) = Pr(X ≤ b)− Pr(X ≤ a) = F (b)− F (a) = ∫ b a f(x) dx. Probability = Area under pdf between a and b Note that for continuous random variable X, Pr(a ≤ X ≤ b) = Pr(a < X < b). Normal distribution The standard normal (Gaussian) pdf is φ(z) = 1√ 2pi exp ( − z 2 2 ) −∞ < z <∞ Notation: Z ∼ N(0, 1) Standard normal pdf is 5 – Bell shaped – Symmetric about z = 0 Standard normal cumulative distribution function (cdf) is Φ(z) = Pr(Z ≤ z) = ∫ z −∞ φ(u) du. Values of Φ(z) may be found in tables. For negative z values, we use Φ(−z) = 1− Φ(z), e.g. Φ(−0.3) = 1− Φ(0.3) = 1− 0.6179 = 0.3821. VIDEO (how to use the tables ‘The Standard Normal Cumulative Distribution Function’) QUIZ The general normal pdf is f(x) = 1 σ √ 2pi exp ( − 1 2 ( x− µ σ )2) −∞ < x <∞ where µ ∈ (−∞,∞) and σ ∈ (0,∞) are the parameters of the distribution. With µ = 0, σ = 1, we obtain the standard normal pdf. Notation: X ∼ N(µ, σ2) Normal pdf is – Bell shaped – Symmetric about x = µ Linear transformations of normal distribution Theorem: If X ∼ N(µ, σ2) and Y = aX + b for constants a, b, then Y is also normally distributed, Y ∼ N ( aµ+ b, a2σ2 ) . Corollary: If X ∼ N(µ, σ2) and Z = X − µ σ , then Z ∼ N(0, 1). Proof: Left as exercise. Hint: Use the theorem above with a = 1/σ and b = −µ/σ. Example: For X ∼ N(2, 9), find Pr(1 < X < 4). We have µ = 2, σ2 = 9, so setting Z = (X − 2)/3 then Z ∼ N(0, 1), and Pr(1 < X < 4) = Pr ( 1− 2 3 < X − 2 3 < 4− 2 3 ) = Pr(−0.33 < Z < 0.67) = Φ(0.67)− Φ(−0.33) = Φ(0.67)− (1− Φ(0.33)) = 0.7486− 1 + 0.6293 = 0.3779. QUIZ 6 1.2 Mean and variance Definitions: 1. Population mean or expectation of a continuous random variable X is E[X] = ∫ ∞ −∞ xf(x) dx where f(x) is the pdf for X. 2. Variance of a random variable X with mean µ = E[X] is Var[X] = E [ (X − µ)2 ] = ∫ ∞ −∞ (x− µ)2 f(x) dx. 3. Standard deviation (s.d.) is √ Var[X]. Some properties for mean and variance: Let a, b, c be constants and X, Y be random variables. Then: E[g(X)] = ∫∞ −∞ g(x)f(x) dx where f(x) is pdf for X. E [aX + bY + c] = aE[X] + bE[Y ] + c Var[aX + b] = a2Var[X] If X, Y independent then E[XY ] = E[X]E[Y ]. If X, Y independent then Var[X + Y ] = Var[X] + Var[Y ]. Var[X] = E[X2]− (E[X])2 Mean of Normal Distribution We will show that for X ∼ N(µ, σ2), we have that E[X] = µ and Var[X] = σ2. To show that E[X] = µ, we first show that E[Z] = 0 where Z ∼ N(0, 1). Using the definition of expectation, we have E[Z] = ∫ ∞ −∞ zφ(z) dz = ∫ 0 −∞ zφ(z) dz + ∫ ∞ 0 zφ(z) dz. Since the standard normal pdf is symmetric, we have φ(−z) = φ(z). So∫ 0 −∞ zφ(z) dz = ∫ 0 −∞ zφ(−z) dz. We substitute u = −z,∫ 0 −∞ zφ(−z) dz = ∫ 0 ∞ (−u)φ(u) (−du) = − ∫ ∞ 0 uφ(u) du and so E[Z] = ∫ 0 −∞ zφ(z) dz + ∫ ∞ 0 zφ(z) dz = − ∫ ∞ 0 zφ(z) dz + ∫ ∞ 0 zφ(z) dz = 0. 7 For X ∼ N (µ, σ2), we then have E [ X − µ σ ] = 0 ⇒ E[X] = 0× σ + µ ⇒ E[X] = µ Note: Since the pdf is symmetric about µ, the median of N (µ, σ2) is also at µ. Mode is µ too. For proof that Var[X] = σ2, see Tutorial 1. Properties of normal distribution For normal distribution, approximately 68% of observed values lie within 1 standard deviation of the mean, that is Pr(µ− σ ≤ X ≤ µ+ σ) ≈ 0.68, 95% lie within 2 standard deviations of the mean, that is Pr(µ− 2σ ≤ X ≤ µ+ 2σ) ≈ 0.95, 99.7% lie within 3 standard deviations of the mean, that is Pr(µ− 3σ ≤ X ≤ µ+ 3σ) ≈ 0.997. 1.3 Chi-squared distribution Definition: If Z1, . . . , Zn are independent, identically distributed (i.i.d.) standard normal random variables, that is Zi ∼ N(0, 1) for all i, then the distribution of Y = Z21 +Z22 + · · ·+Z2n is called the χ2-distribution with n degrees of freedom. Notation: Y ∼ χ2n The probability density function is f(x) = 1 2n/2 Γ ( n 2 ) xn2−1 e−x/2 x ≥ 0 where Γ denotes the gamma function, defined for x > 0 by Γ(x) = ∫∞ 0 t x−1e−t dt. For Y ∼ χ2n, we have – E[Y ] = n – Var[Y ] = 2n Proof: E[Y ] = E [ n∑ i=1 Z2i ] = n∑ i=1 E [ Z2i ] = n∑ i=1 Var [Zi] = n∑ i=1 1 = n Var[Y ] = Var [ n∑ i=1 Z2i ] = n∑ i=1 Var [ Z2i ] = n∑ i=1 ( E [ Z4i ] − ( E [ Z2i ])2) = . . . (∗) 8 Note that on the line above we found that E [Z2i ] = 1 and so we just need to find E [Z 4 i ]. After using properties for expected values and some calculations (left as exercise), we can find that E [Z4i ] = 3. And so, we get (∗) . . . = n∑ i=1 (3− 12) = n∑ i=1 2 = 2n. Degrees of freedom – in general, it is the number of independent pieces of information that are free to vary when estimating parameters. In this module, the definition is in the following context: the degrees of freedom for an estimate of a parameter is equal to the number of independent observations minus the number of estimated parameters which are needed for the estimation of the parameter itself. Example: An estimate for the variance σ2 is the sample variance s2 = 1 n− 1 n∑ i=1 ( Xi − X¯ )2 . In this formula, there are n independent observations (X1, X2, . . . , Xn) and one estimate of parameter µ (population mean) which is X¯. So, the degrees of freedom of an estimate of variance is n− 1, where n is the number of observations. 1.4 Student t-distribution Definition: If Z ∼ N(0, 1) and Y ∼ χ2n with Z and Y independent, then the distribution of T = Z√ Y/n is called the t-distribution with n degrees of freedom. Notation: T ∼ tn The probability density function is f(x) = Γ ( n+1 2 ) Γ ( n 2 )√ npi ( 1 + x2 n )−(n+1)/2 −∞ < x <∞ 9 – f(x) is symmetric about x = 0. – As n→∞, then tn → N(0, 1). 1.5 F -distribution Definition: If X ∼ χ2m and Y ∼ χ2n with X and Y independent, then the distribution of F = X/m Y/n is called the F -distribution with (m,n) degrees of freedom. Notation: F ∼ Fm,n The probability density function is f(x) = Γ ( m+n 2 ) Γ ( m 2 ) Γ ( n 2 ) (m n )m/2 x(m/2)−1 ( 1 + m n x )−(m+n)/2 x ≥ 0 10 1.6 Notation of critical values and tables For all above distributions, including standard normal distribution, there are tables of critical values. The critical values are obtained from a specific distribution and are mainly used in hypothesis testing. In the following we will discuss the notation used in this module and show how to use the tables for each of the distributions. The tables can be found on MATH263 Canvas page. For χ2, t, F -distributions, we will use subscripts for degrees of freedom, and a number in brackets to represent the amount of probability to the right (in decimals). For example: t8(0.10) represents the value with 10% probability to the right, for the t-distribution with 8 degrees of freedom. From tables, we can find t8(0.10) = 1.397. χ11(0.01) represents the value with 1% probability to the right, for the χ 2-distribution with 11 degrees of freedom. From tables, we can find χ11(0.01) = 24.72. F12,4(0.05) represents the value with 5% probability to the right, for the F -distribution with (12, 4) degrees of freedom. From tables, we can find F12,4(0.05) = 5.912. For standard normal distribution, the notation is just z with a number in brackets representing the amount of probability to the right. For example, z(0.026) represents the value with 2.6% probability to the right, for the standard normal distribution. From tables, we can find z(0.026) = 1.9431. Note that tables often use various different notations, e.g. what we write as t8(0.10) may appear as t8(10) or t0.9(8). 4 VIDEOs (how to use tables ‘Percentage points of the standard normal distribution’, ‘Percentage points of the t-distribution’, ‘Percentage points of the χ2-distribution’, ‘Percentage points of the F -distribution’) QUIZ 11 2 Estimators and Sampling Distributions Introduction to Statistical Inference When we collect a set of data (sample) from a population, we may want to infer conclusions about the population, based on this sample, and assess in terms of probability how confident we can be about our conclusions. This is known as statistical inference. Two most common types of statistical inference are: 1. Confidence intervals 2. Hypothesis tests Parametric (based on normal distribution) Non-parametric (based on non-normal distribution) When drawing these conclusions about a population, we look at parameters that describe it. Pa- rameters are values that describe properties of the whole population. Parameters are usually not known since most of the time it is impossible to gather information about the whole population. We can estimate population parameters, using a random sample from this population and finding an estimator. Estimators and Sampling Distributions Definition: A function of the random sample {X1, . . . , Xn} which is used to estimate the value of an unknown parameter is called an estimator. Example: Population mean µ is an unknown parameter. We may use the sample mean X¯ = 1 n ∑n i=1Xi as an estimator of µ. Notice that an estimator is itself a random variable, with a distribution of its own, called the sam- pling distribution. The sampling distribution shows how an estimator would change if we repeatedly produced random samples. It is important to keep in mind that both types inference (confidence intervals and hypothesis tests) are based on the sampling distributions of estimators, and so they report probabilities that state what would happen if we used the inference method many times. VIDEO (about random sampling) Sampling distribution of the sample mean Let us select a random sample {X1, . . . , Xn} from a population with mean µ and variance σ2. That is, X1, . . . , Xn are independent, identically distributed random variables with E [Xi] = µ and Var [Xi] = σ2. Then the expected value of X¯ is E [ X¯ ] = E [ X1 + · · ·+Xn n ] = E [X1] + · · ·+ E [Xn] n = µ+ · · ·+ µ n = nµ n = µ. 12 So, the expectation of the sample mean is the population mean. Now let us compute the variance: Var [ X¯ ] = Var [ X1 + · · ·+Xn n ] = Var [X1] + · · ·+ Var [Xn] n2 = σ2 + · · ·+ σ2 n2 = nσ2 n2 = σ2 n . So, the standard deviation of X¯ is s.d. ( X¯ ) = σ/ √ n , which is called the standard error, s.e. ( X¯ ) . Notes about the sampling distribution of X¯: The larger n is, the smaller the standard error is, and the better X¯ is as an estimator of µ. The shape of the distribution of X¯ depends on the distribution of the Xi. Important special case: If Xi are normally distributed, then so is X¯. That is, if Xi ∼ N (µ, σ2) (independently) then X¯ ∼ N (µ, σ2/n) . However, even if the Xi have some other distribution, for large n, the distribution of X¯ is approximately normal. This is known as the Central Limit Theorem and it is one of the most famous facts of probability theory. Central Limit Theorem Let us draw a random sample of size n from any population distribution with mean µ and finite variance σ2. When sample size n is large then the sampling distribution of X¯ is approximately normal: X¯ ∼ N (µ, σ2/n) approximately. The bigger n is, the better approximation we get. How big n has to be for good approximation, depends how close the shape of the population distribution is to normal distribution; larger n is required if the shape of the population distribution is far from normal. Usually, in practice and in this module, the sample sizes n ≥ 30 are considered as large. VIDEO (Central Limit Theorem in action) Unbiased estimators Definition: Suppose θˆ is an estimator of a parameter θ. Then θˆ is called unbiased if E [ θˆ ] = θ. Otherwise, we say θˆ is biased. The value of E [ θˆ ] − θ is called the bias of θˆ. Example: For (independent) random sample {X1, . . . , Xn} with E [Xi] = µ and Var [Xi] = σ2, we have shown that E [ X¯ ] = µ. Since E [ X¯ ] = µ, X¯ is an unbiased estimator of µ. In Tutorial 1, you will show that E [s2] = σ2 where s2 = 1 n− 1 n∑ i=1 ( Xi − X¯ )2 is the sample variance. That is, s2 is an unbiased estimator of σ2. 13 Note that the sample standard deviation s is not unbiased for σ since Var[s] = E[s2]− (E[s])2 > 0⇔ (E[s])2 < E[s2]⇔ E[s] < √ E[s2] = σ, and so E[s] 6= σ. 14 3 Confidence intervals for a population mean Suppose {X1, . . . , Xn} is a random sample from population following N (µ, σ2). That is, X1, . . . , Xn are independent, identically distributed random variables with Xi ∼ N (µ, σ2) for each i. We know that X¯ is an unbiased estimator of µ, X¯ ∼ N ( µ, σ2 n ) . We want to improve on the point estimate X¯, by including its variability and hence constructing a confidence interval for µ. The confidence interval for µ and many other confidence intervals are of the form estimate±margin of error. The margin of error reflects how accurate we believe our estimate is, based on the variability of the estimate. The confidence intervals tell us how confident we are that the procedure will capture the true population mean µ. There are two types of confidence intervals for a population mean, based on whether the population variance σ2 is known or unknown. 3.1 Confidence interval for µ when variance is known When σ2 is known, we have that X¯ ∼ N ( µ, σ2 n ) and so then X¯ − µ σ/ √ n ∼ N(0, 1). We will now derive a 95% confidence interval for µ when variance is known. For the 95% confidence interval what we want is an interval [X¯ − , X¯ + ] containing µ where the probability of producing this interval is 95% and where is the margin of error. Mathematically written, we have Pr ( X¯ − < µ < X¯ + ) = 0.95. That is, Pr ( µ− < X¯ < µ+ ) = 0.95 ⇒ Pr ( − σ/ √ n < X¯ − µ σ/ √ n < σ/ √ n ) = 0.95 ⇒ Pr (−k < Z < k) = 0.95 where Z ∼ N(0, 1) and k = σ/ √ n . From normal tables, we can find that k = 1.96 and so σ/ √ n = 1.96 ⇒ = 1.96 σ√ n . 15 The 95% confidence interval (CI) is[ X¯ − 1.96 σ√ n , X¯ + 1.96 σ√ n ] . VIDEO (explaining the details of the calculations above) Because we have assumed that σ is known, everything in the above expression can be computed from the sample: X¯, n , σ are all known. Notes about confidence intervals: The CI is a random interval since a different sample (from the same population) will give a different interval. General meaning of 95% CI is: If one repeatedly performs the experiment, 95% of all samples give a CI which contains µ. 95% is called the confidence level. Example: Suppose cholesterol levels in a population known to be normally distributed with standard deviation σ = 30, unknown mean µ (units of mg/dl). We randomly choose n = 20 people, measure their cholesterol level and find the sample mean to be x¯ = 185. Find the 95% confidence interval for µ. Note that the point estimate of µ is µˆ = 185 which is the value of x¯. (In general, the estimates of parameters are denoted with the hat .ˆ) Using the above formula for the 95% CI for µ, we have that in our case the 95% CI for µ is 185± 1.96× 30√ 20 = 185± 13.148 = [171.852, 198.148]. Interpretation of a specific confidence interval We have found that the 95% confidence interval for our specific sample is [171.852, 198.148]. It is important to understand that once we draw a particular sample and find the confidence interval for this sample, there is no randomness involved. So, there are only two possibilities for this interval: The interval [171.852, 198.148] contains true µ. The interval [171.852, 198.148] does not contain true µ. We cannot know whether our sample is one of the 95% samples for which µ is contained in the interval [171.852, 198.148] or one of the unlucky 5% which does not contain µ. When interpreting a specific CI, there is conventional statement ‘we are 95% confident’. This is shorthand for saying that ‘we produced this interval by a method that gives correct results for 95% of all possible samples’. So, to interpret the CI calculated above we say: We are 95% confident that the population mean cholesterol level is between 171.852 mg/dl and 198.148 mg/dl. Do not use other statements to interpret a specific CI, such as ‘there is 95% probability’ or ‘95% chance’ etc. since they are common misinterpretations! VIDEO (explaining the concept of confidence intervals, using a statistical applet) 16 Confidence intervals with other confidence levels For a 99% CI, we would have Pr (−k < Z < k) = 0.99 and from tables, we get k = 2.5758. So, the 99% CI is [ X¯ − 2.5758 σ√ n , X¯ + 2.5758 σ√ n ] . In general, the 100(1− α)% CI is[ X¯ − z ( α 2 ) σ√ n , X¯ + z ( α 2 ) σ√ n ] where z ( α 2 ) represents the point with Pr ( Z > z ( α 2 )) = α 2 for Z ∼ N(0, 1). Notes about confidence levels and CI: Higher confidence makes the interval wider (for the same sample), e.g. 95% CI uses critical value 1.96, 99% CI uses 2.5758, thus making the 99% CI wider, compared to the 95% CI. So, to be more confident that we have included the true µ value, we need to widen the interval. Confidence intervals and sample size Increasing sample size makes interval narrower. That is, collecting more information allows us to be more precise about value of µ. Example: Suppose cholesterol levels in a population known to be normally distributed with standard deviation σ = 30, unknown mean µ. Find what the sample size would have to be if we want a 95% confidence interval of width 10. The interval must be [ X¯ − 5, X¯ + 5 ] , so the sample size n must satisfy z(0.025)× σ√ n = 1.96× 30√ n = 5 ⇒ n = ( 1.96× 30 5 )2 = 138.2976 ≈ 138. We would have to sample 138 people. QUIZ 3.2 Confidence interval for µ when variance is unknown More realistically, we usually don’t know the value of σ, and we have to estimate it. We know that the sample variance s2 = 1 n− 1 n∑ i=1 ( Xi − X¯ )2 is unbiased estimator of σ2. It can be shown that X¯ ∼ N ( µ, σ2/n ) – this was shown before, (n− 1)s2 σ2 ∼ χ2n−1 – this will be shown later, X¯ and s2 are independent random variables – to show this is outside syllabus of this module. 17 Using the above and the definition of t-distribution, we get X¯−µ σ/ √ n√ (n−1)s2 σ2 / (n− 1) ∼ tn−1 ⇒ X¯ − µ s/ √ n ∼ tn−1 . Using a similar procedure for derivation of the confidence interval for µ when σ is known, we can deduce the confidence interval for µ when σ is unknown: 100(1− α)% CI is now [ X¯ − tn−1 ( α 2 ) s√ n , X¯ + tn−1 ( α 2 ) s√ n ] where tn−1 ( α 2 ) represents the point with Pr ( T > tn−1 ( α 2 )) = α 2 for T ∼ tn−1. Notes: The formula is the same as when σ was known, except that σ is replaced by s, and that means instead of N(0, 1) have to use tn−1 distribution tables, to allow for extra uncertainty. Estimated standard error of X¯ is e.s.e. ( X¯ ) = s/ √ n. As the sample size gets bigger, the CI gets narrower, converging to the case of known variance. Example: Given n = 9, x¯ = 3.12, s2 = 1.37, find the 95% confidence interval for µ. To find 95% CI, first we use t-tables to find tn−1 ( α 2 ) = t9−1 ( 0.05 2 ) = t8(0.025) = 2.306 and then 95% CI is[ X¯ − tn−1 ( α 2 ) s√ n , X¯ + tn−1 ( α 2 ) s√ n ] = [ 3.12− 2.306 √ 1.37/9, 3.12 + 2.306 √ 1.37/9 ] = [3.12− 0.8997, 3.12 + 0.8997] = [2.2203, 4.0197]. Interpretation: We are 95% confident that the population mean is between 2.2203 and 4.0197. QUIZ 3.3 Confidence intervals for large samples, robustness Note that above theory for CI is for normally distributed random variables X1, . . . , Xn. However, we know that Central Limit Theorem says that for large enough samples, X¯ is approxi- mately normally distributed, for any population distribution with finite mean and variance. So provided that n is large, above CI formulae can be used whatever the distribution of X1, . . . , Xn. However, note that if X1, . . . , Xn are not normal, then CI is only approximate; approximation gets better as n gets bigger. 18 We say that CI formulae are robust to non-normality. This means that even if the normality assump- tion is not satisfied and n is sufficiently large (n ≥ 30), CI formulae will still give us approximately correct results. Robustness will be discussed for hypothesis tests in following chapters. In general, robustness means that even though the assumptions of a test are not exactly fulfilled and n is sufficiently large (n ≥ 30), the distribution of the test statistic is approximately correct and so the results of the test should be approximately correct. 19 4 The Concept of Hypothesis Testing 4.1 Hypothesis testing – terminology and key steps The goal of hypothesis testing is to assess the evidence provided by the data in favour of some claim about the population parameters. The hypothesis is a statement about a population parameter. The first step of hypothesis testing is to formulate hypothesis statements - the null hypothesis and alternative hypothesis. The null hypothesis, denoted H0, is a statement of no effect or no difference. When performing the hypothesis test, we assess the strength of evidence against the null hypothesis. The alternative hypothesis, denoted H1 is the statement that we suspect is true instead of H0. If there is evidence against H0, then we may conclude that there is evidence for H1 as the statement explaining our parameter. H1 is usually a statement of a difference. Important note: H0 and H1 must be stated in terms of population parameters, not their estimators, since hypotheses always refer to some populations or a model, not a particular outcome. Example: H0 : the mean of the population, µ, is equal to 10, that is µ = 10, H1 : µ is different from 10, that is µ 6= 10 There are different types of the alternative hypothesis – one-sided and two-sided. A one-sided or one-tailed test – the population parameter differs from its hypothesised value in a specific direction, that is the parameter is either larger or smaller than the hypothesised value. A two-sided or two-tailed test – the population parameter is different from the hypothesised value; the direction is not specified. Example: If H0 : µ = 10, then for a one-sided test, we could have H1 : µ > 10 or H1 : µ < 10, for a two-sided test, we have H1 : µ 6= 10. The information to determine whether H1 is one-sided or two-sided depends on the context and is given in the research question. H1 should be always formulated before we have collected/seen the data. It is not appropriate to first look at the data and then formulate H1 to fit your data! If a particular direction for H1 is not clear in advance, then you should use a two-sided alternative. For example, if we want to test a new drug to see if it is better than old drug, then a one-sided test would be appropriate. Example: Formulate the null and alternative hypotheses for the following research question: Is the average monthly rent of a one-bedroom flat in Liverpool more than £600? Denote the mean monthly rent of all one-bedroom flats in Liverpool by µ. We are comparing µ to the value of £600, and we are interested whether µ is larger than £600, so it is an one-sided test. Our hypotheses are: H0 : µ = 600, H1 : µ > 600. QUIZ 20 In order to determine how strong our evidence against H0 is, first we calculate a test statistic. The test statistic is a random variable with a known distribution and will depend on the type of statistical test we are carrying out. In practice, we collect the data and we calculate the value of test statistic for the specific data sample, using a formula. Example: The test statistic for one of the most common tests, Z-test, is given by Z = X¯ − µ0 σ/ √ n where µ0 is the value of the hypothesised population mean. It can be easily seen that Z ∼ N(0, 1). The test statistic is based on the statistic that estimates the parameter that appears in the hypotheses. E.g. in the example above X¯ estimates µ0. When H0 is true, we expect the value of the estimate to be close to the parameter value specified by H0. If values of the estimate are far from the parameter value specified by H0, this gives evidence against H0. To assess how far the estimate is from the parameter, we standardise the estimate. So, the test statistics are commonly in the form estimate – hypothesised value stadard deviation of the estimate . The test statistic is used for the probability calculation which indicates the strength of evidence against H0, so called p-value. Definition of p-value: The p-value of a test is the probability, assuming H0 is true , that we would obtain a value of our test statistic as extreme or more extreme than the one we have observed. Extreme here means ‘far from what we expect’, assuming H0 is true. The direction counting as ‘far from what we expect’ is determined by H1. The smaller the p-value we get, the stronger the evidence against H0 we have, given by our data. In many disciplines, the p-value is then compared to the significance level of the test. The significance level, denoted by α, can be thought of as an evidence threshold. General rules (for all hypothesis tests): If the p-value is less or equal to this threshold, that is p ≤ α, then we reject the null hypothesis H0 at α and there is evidence to support the alternative hypothesis H1, based on our data. If p > α, then we fail to reject H0 at α and there is not enough evidence to support H1, based on our data. 21 When we reject H0 at α, we say the result is statistically significant at this significance level. Most commonly the significance level is set at 0.05, that is 5%. QUIZ The key steps of a hypothesis test are: 1. State the null and alternative hypotheses. 2. Calculate the value of the test statistic. 3. Find a p-value for the test. 4. Decide whether to reject or not to reject the null hypothesis at the significance level α. 5. Interpret your conclusions practically in the context of the question. Important notes: When we reject H0, it does not mean that it is not true, it just means that based on our data, we found evidence against H0 and so, in turn, we have evidence to support H1. Similarly, if we fail to reject H0, it does not make H0 necessarily true, it just means that based on our data, we do not have enough evidence against H0. When interpreting results practically, we usually do so with respect to the alternative hypothesis. (See the example below.) Example: Cholesterol level in a particular population was known to have a mean value µ = 170 in the past. We also know that the population standard deviation of the cholesterol is σ = 30. A random sample of n = 20 people was taken and their sample mean was 185. Perform a Z-test to determine whether there is evidence that the population mean has increased. Test at the 5% significance level. First, we state the null and alternative hypotheses. We are interested whether the population mean of cholesterol level, µ, has increased, that is we have a specific direction for our alternative, and so it is a one-sided test. H0 : µ = 170 H1 : µ > 170 Now we compute the value of the test statistic for the Z-test, for our data: Z = X¯ − µ0 σ/ √ n = 185− 170 30/ √ 20 = 2.236. Now we compute the p-value for this test. The p-value is the probability of observing a value of Z as extreme or more extreme than the observed one, that is 2.236, assuming H0 is true. Since the direction in H1 is ‘>’, the same will appear in the calculation of the p-value. That means that we have p = Pr (Z ≥ 2.236) where Z ∼ N(0, 1) = 1− Pr(Z ≤ 2.236) ≈ 1− Pr(Z ≤ 2.24) = 1− 0.98745 = 0.01255. Since p = 0.01255 < 0.05, we reject H0 at the 5% level. We can conclude that there is evidence that the mean cholesterol level has increased from 170. The p-value for this test is illustrated in the picture below. The p-value is the blue area. 22 VIDEO (understanding p-values) In practice, when performing statistical tests, we use statistical software packages, such as R, SPSS, Minitab, and others. These packages calculate the exact p-value automatically. However, there are methods which can be used to decide whether we can reject or not to reject H0 at some α without finding the exact p-value. We will discuss these methods in the following section. Even though in many disciplines, the significance level α is decided in advance, we do not really need to decide on α in advance. We can just look at the p-value and see how big it is. Then based on the following rules, we can determine the strength of evidence against H0 and so in turn, the strength of evidence to support H1: If p > 0.10, there is very little evidence against H0 or we can say there is no evidence against H0. If 0.05 < p ≤ 0.10, there is weak evidence against H0. If 0.01 < p ≤ 0.05, there is evidence against H0. If 0.001 < p ≤ 0.01, there is strong evidence against H0. If p ≤ 0.001, there is very strong evidence against H0. 4.2 Using critical values for hypothesis testing As mentioned above, there are methods which can be used to decide whether we can reject or not to reject H0 at some α without finding the exact p-value. A traditional method of hypothesis testing uses a table of known critical values. The test statistic follows a particular distribution. So the value of the test statistic, based on a particular data sample, is compared to the critical values obtained from this distribution. The critical values can be found in the tables related to an appropriate distribution. These tables lay out the values that we would expect the specified distribution to take at some particular significance level. We reject the null hypothesis H0 at the significance level α, depending on H1, following these rules (for most parametric tests): 23 If H1: parameter < hypothesised value, then we reject H0 at α, if the value of the test statistic is less than (or equal to) the critical value at α for the lower tail of the relevant distribution. If H1: parameter > hypothesised value, then we reject H0 at α, if the value of the test statistic is greater than (or equal to) the critical value at α for the upper tail of the relevant distribution. If H1: parameter 6= hypothesised value, then we reject H0 at α, if the value of the test statistic is less than (or equal to) the critical value at α/2 for the lower tail of the relevant distribution or greater than (or equal to) the critical value at α/2 for the upper tail of the relevant distribution. Note that for a two-sided test, we investigate the values of a test statistic in either direction against the null hypothesis. That is why we look at the critical values in both tails of the distribution, and hence using the half of the significance level. Note that every test uses a different test statistic and so above general rejection rules may differ slightly for some tests. The above rules will be stated specifically for every main hypothesis test covered in this module. A derivation for the above rules will be explained on a specific test, the Z-test, covered later. VIDEO (examples – using critical values for hypothesis testing) QUIZ 4.3 Type I and Type II errors When performing a hypothesis test, there are two possible decisions: we reject H0, or we fail to reject H0. Remember that we make an inference about population parameters, using a particular sample, and we do not know the population parameters. We usually do not know whether our inference is correct or not. In reaching a decision, we may make two types of errors: Type I error: We reject H0 when H0 is in fact true. The probability of Type I error is denoted by α: α = Pr (Type I error) Type II error: We fail to reject H0 when H0 is in fact false. The probability of Type II error is denoted by β: β = Pr (Type II error) The summary of possibilities is in table below. H0 is true H0 is false Reject H0 Type I error Correct decision Fail to reject H0 Correct decision Type II error The probability of Type I error, α, is the significance level. 24 The probability of correctly rejecting H0 is called power and is calculated as Power = 1− β. Ideally, we want to make both α and β small, but this can only be done by increasing the sample size. Type I error is considered primary, so α is usually fixed at some small value (such as 0.05). Then we try to make β as small as possible, that is, maximise the power of the test. There is a relationship between α and β. With the sample size fixed, when α decreases, β will increase. More on power (and α, β) will be discussed later. Example: A new low cost diagnostic test has been developed that claims to diagnose a particular medical condition in its early stages. The patient is assumed not to have the condition unless there is evidence from the test to contradict this. We will discuss Type I and Type II errors in this context and their consequences. First, we formulate the hypotheses: H0: The patient does not have the condition. H1: The patient has the condition. A Type I error is when the condition is diagnosed in a patient that does not have the condition. This would result in an expensive treatment being administered, as well as any possible side effects of such treatment. Note that a Type I error is sometimes referred to as a false positive. A Type II error is when the condition is not diagnosed in a patient that has the condition. This could result in the condition progressing and possibly going unnoticed. Note that a Type II error is sometimes referred to as a false negative. 25 5 Tests for a population mean Tests for a population mean are used to test whether the population mean, µ, has a specified value µ0. The underlying assumption for these tests is that the population is normally distributed. There are two types of tests for a population mean, depending on whether the population variance σ2 is known or not. If σ2 is known, then we use the Z-test. If σ2 is unknown, then we use the t-test. Both tests will be discussed in the following sections. 5.1 Z-test Assumption: The population is normally distributed. Use: To test whether the population mean, µ, has a specified value µ0, when the population variance σ2 is known. To carry out a Z-test for a population mean we use the following steps: 1. State hypotheses, the null hypothesis H0 : µ = µ0 and the alternative hypothesis H1 : µ < µ0 for the lower tail, H1 : µ > µ0 for the upper tail, or H1 : µ 6= µ0 for the two-tailed alternative. 2. Calculate the test statistic Z = X¯ − µ0 σ/ √ n . 3. Compare the value of the test statistic to the critical value in the standard normal distribution tables for the relevant significance level, remembering to half the significance level for a two- tailed test. Alternatively, compute the p-value. 4. Decide whether to reject or not to reject the null hypothesis at α. Remember that Z ∼ N(0, 1) and so we use critical values z(α) or z(α/2) as following: For the alternative H1 : µ < µ0, we reject H0 at α if Z < −z(α). For the alternative H1 : µ > µ0, we reject H0 at α if Z > z(α). For the alternative H1 : µ 6= µ0, we reject H0 at α if Z < −z(α2 ) or Z > z(α2 ), in other words, if |Z| > z(α 2 ). If we calculate the exact p-value, then, of course, we use general rules: That is if p ≤ α, then we reject H0 at α, otherwise we fail to reject H0 at α. 5. Interpret your conclusion practically in the context of the question. 26 VIDEO (deriving rejection rules) Example: Cholesterol level in a particular population was known to have a mean value µ = 170 in the past. We also know that the population standard deviation of the cholesterol is σ = 30. A random sample of n = 20 people was taken and their sample mean was 185. Perform Z-test to determine whether there is evidence that the population mean has increased. Test at the 5% significance level. You may assume that the cholesterol level is normally distributed. Note that we solved this problem before, but this time we solve it without calculating the exact p-value; we use critical values. First, we state the null and alternative hypotheses. H0 : µ = 170 H1 : µ > 170 Now we compute the value of the test statistic for the Z-test, for our data: Z = X¯ − µ0 σ/ √ n = 185− 170 30/ √ 20 = 2.236. In the tables we look for the critical value at the 5% level, that is z(0.05) = 1.6449. We compare this with the value of test statistic 2.236. Since 2.236 > 1.6449, we reject H0 at the 5% level. Note that since our alternative is H1 : µ > 170, we used the rejection rule for H1 : µ > µ0, which is that we reject H0 at α if Z > z(α). There is evidence that mean cholesterol level has increased from 170. Important note: In this module, you will not be usually given a significance level. When using tables of critical values, the aim is to find the lowest significance level at which we can reject H0 and the highest level at which we cannot reject H0. You do that by comparing the value of a test statistic with relevant critical values. Once you identify these levels, you will know an interval which contains the p-value and so you will be able to assess the strength of evidence according to general rules. Always include the strength of evidence when interpreting your conclusion. VIDEO (Example – when α not given) Example: Cholesterol level in a particular population was known to have a mean value µ = 170 in the past. We also know that the population standard deviation of the cholesterol is σ = 30. A random sample of n = 20 people was taken and their sample mean was 185. Perform Z-test to determine whether there is evidence that the population mean differs from 170. You may assume that the cholesterol level is normally distributed. First, we state the null and alternative hypotheses. This time, we want to test whether the population mean of cholesterol level, µ, differs from 170. There is no specific direction for our alternative, and so it is a two-sided test. H0 : µ = 170 H1 : µ 6= 170 Now we compute the value of the test statistic for the Z-test, for our data (already computed previously): Z = X¯ − µ0 σ/ √ n = 185− 170 30/ √ 20 = 2.236. 27 In the tables we look for the critical values z. With respect to the tables, we want to find the lowest significance level at which we can reject H0 and the highest level at which we cannot reject H0. So, in tables we find the critical value z(0.013) = 2.2262 which is smaller that the value of our test statistic 2.236, and the critical value z(0.012) = 2.2571 which is bigger than 2.236. Since 2.2262 < 2.236 < 2.2571, we can reject H0 at the 2.6% level, but not at the 2.4% level. (Do not forget that since this is a two-sided test, we need to multiply the significance level for critical values by 2, e.g. since we use z(0.013), α is 2 × 1.3% = 2.6%.) This means that the p-value is in (0.024, 0.026) which is contained in (0.01, 0.05]. Hence we can conclude that there is evidence that mean cholesterol level is different from 170. As an exercise, try to calculate the exact p-value for the test. QUIZ 5.2 t-test Assumption: The population is normally distributed. The t-test is robust to non-normality. This means that if the population is not normally distributed, it is still OK to use the t-test, provided the sample size n ≥ 30. Use: To test whether the population mean, µ, has a specified value µ0, when the population variance σ2 is unknown. To carry out a t-test for a population mean we use the following steps: 1. State hypotheses, the null H0 : µ = µ0 and the alternative H1 : µ < µ0 for the lower tail, H1 : µ > µ0 for the upper tail, or H1 : µ 6= µ0 for the two-tailed alternative. 2. Calculate the test statistic T = X¯ − µ0 s/ √ n where s is the sample standard deviation, that is s = √ 1 n−1 ∑n i=1(Xi − X¯)2. Note that since σ is unknown, an estimated value is used instead in the formula, that is the sample standard deviation s. 3. Compare the value of the test statistic to the critical value in the t-distribution tables for the relevant significance level, remembering to half the significance level for a two-tailed test. Alternatively, compute the p-value. 4. Decide whether to reject or not to reject the null hypothesis at α. As shown before T ∼ tn−1 and so we use critical values tn−1(α) or tn−1(α/2) as following: For the alternative H1 : µ < µ0, we reject H0 at α if T < −tn−1(α). For the alternative H1 : µ > µ0, we reject H0 at α if T > tn−1(α). For the alternative H1 : µ 6= µ0, we reject H0 at α if T < −tn−1(α2 ) or T > tn−1(α2 ), in other words, if |T | > tn−1(α2 ). If we calculate the exact p-value, then, of course, we use general rules: That is if p ≤ α, then we reject H0 at α, otherwise we fail to reject H0 at α. 28 5. Interpret your conclusion practically in the context of the question. Note: If you are not given α, find the lowest level at which we can reject H0 and the highest level at which we cannot reject H0, following the rules in step 4 for using critical values. Example: Cholesterol level in a particular population was known to have a mean value µ = 170 in the past. A random sample of n = 20 people was taken and their sample mean was 155.3 and sample standard deviation was 29.3. Perform an appropriate test to determine whether there is evidence that the population mean is different from 170. You may assume that the cholesterol level is normally distributed. First, we state the null and alternative hypotheses. H0 : µ = 170 H1 : µ 6= 170 Note that since σ is not known, we use the t-test. We compute the value of the test statistic for the t-test, for our data: T = X¯ − µ0 s/ √ n = 155.3− 170 29.3/ √ 20 = −2.244. In the tables we look for the critical values t19. At the 5% level, we find t19(0.025) = 2.093 and at the 2% level, we have t19(0.01) = 2.539. We can see that when compared with the absolute value of test statistic, | − 2.244|, we have 2.093 < 2.244 < 2.539. This means that we can reject H0 at the 5% level, but not at the 2% level. So, the p-value is in (0.02, 0.05). We can conclude that there is evidence that mean cholesterol level is different from 170. VIDEO (example) QUIZ 29 6 More on Hypothesis Testing 6.1 Power and Sample Size Calculation In this section we will demonstrate on an example how to calculate power of a test. Example: Cholesterol level in a particular population was known to be normally distributed with mean 170 and the population standard deviation σ = 30. A random sample of 20 people was taken and their cholesterol level was measured. First, we want to determine how big the sample mean cholesterol level would have to be to conclude that there had been an increase in the mean cholesterol level of the population at the 5% significance level. We state hypotheses first. We know that µ = 170 and we are interested when there is an increase in the mean cholesterol level of the population, which gives us information that H1 is one-sided. H0 : µ = 170 H1 : µ > 170 Note that the Z-test is required here, as σ is known. “To conclude that there had been an increase in the mean cholesterol level of the population at the 5% significance level” in other words means that we have to find when we can reject H0 at the 5% significance level. We use the rejection rule for the given alternative, which is Z > z(α), and so in our case we reject H0 if X¯ − 170 30/ √ 20 > z(0.05) = 1.6449 X¯ > 1.6449× 30√ 20 + 170 = 181.034. The sample mean X¯ would have to be bigger than 181.034 in order to reject H0 at the 5% significance level. Next we want to calculate the power of this test. The power is the probability of correctly rejecting H0, in other words, it is the probability that we reject H0 when H1 is true. Mathematically written, it is Power = Pr (Reject H0 | H1) = Pr ( X¯ > 181.034 | µ > 170 ) = Pr ( X¯ − µ 30/ √ 20 > 181.034− µ 30/ √ 20 ∣∣∣∣∣µ > 170 ) = Pr ( Z > 181.034− µ 30/ √ 20 ∣∣∣∣∣µ > 170 ) where Z ∼ N(0, 1) However, we cannot work out this value, because the alternative hypothesis µ > 170 isn’t specific enough. So, the value of µ needs to be specified. For example, one can ask what the power of the test is to detect an increase of 5 mg/dl from 170. Now we know that the increase is by 5 which means that H1 is specified to be µ = 175. So, we get 30 Power = Pr ( Z > 181.034− µ 30/ √ 20 ∣∣∣∣∣µ = 175 ) = Pr ( Z > 181.034− 175 30/ √ 20 ) = Pr(Z > 0.899) = 1− Pr(Z < 0.899) ≈ 1− Φ(0.90) = 1− 0.8159 = 0.1841 The power of the test is 18.41%. How will the power change if in fact µ = 181? Then we get Power = Pr ( Z > 181.034− µ 30/ √ 20 ∣∣∣∣∣µ = 181 ) = Pr ( Z > 181.034− 181 30/ √ 20 ) = Pr(Z > 0.005) = 1− Pr(Z < 0.005) ≈ 1− Φ(0.01) = 1− 0.5040 = 0.496 This time the power is 49.6%. So, we can observe that the power increases as µ gets further away from 170. This means that the larger the true population mean (above 170) is, the greater the probability of correctly rejecting H0 is. VIDEO (details of the calculations in the example above) VIDEO (relationship between α, β and power) QUIZ Example (sample size calculation): Cholesterol level in a particular population was known to be normally distributed with mean 170 and the population standard deviation σ = 30. The null hypothesis H0 : µ = 170 is to be tested against the alternative H1 : µ > 170 at the 5% significance level. What would the sample size have to be if we want to have 75% probability of detecting a real increase of 5? Since hypotheses are stated, we start with the rejection rule for our case. We reject H0 at the 5% level if X¯ − 170 30/ √ n > z(0.05) = 1.6449 X¯ > 1.6449 30√ n + 170 We want 75% probability of detecting a real increase of 5, that is, the power of 75% when µ = 175. If µ = 175, then X¯ ∼ N (175, 302/n), so Power = Pr (Reject H0 | µ = 175) 0.75 = Pr ( X¯ > 1.6449 30√ n + 170 ∣∣∣∣∣µ = 175 ) 0.75 = Pr ( X¯ − 175 30/ √ n > 1.6449− 5 30/ √ n ∣∣∣∣∣µ = 175 ) 0.75 = 1− Φ ( 1.6449− 5 √ n 30 ) . 31 We require Φ ( 1.6449− 5 √ n 30 ) = 0.25 Φ ( 5 √ n 30 − 1.6449 ) = 0.75 5 √ n 30 − 1.6449 = 0.6745 n = ( (1.6449 + 0.6745)× 30 5 )2 = 193.666. So, we would need a sample of 194 people. Note that algebraic manipulations and values from the tables were used in above calculations. For more detailed explanation of the calculations watch the video on Mo¨bius. VIDEO (details of the calculations in the example above) 6.2 Practical Significance vs Statistical Significance We will explore how the sample size affects the statistical significance, and its consequences to practical meaning. Let’s consider a two-sided Z-test, H0 : µ = µ0 versus H1 : µ 6= µ0 for some specified µ0 value. We know that we reject H0 if ∣∣∣X¯ − µ0∣∣∣ σ/ √ n > z ( α 2 ) ⇔ ∣∣∣X¯ − µ0∣∣∣ > σ × z ( α 2 ) √ n . Note that for fixed significance α, as n increases, the difference ∣∣∣X¯ − µ0∣∣∣ which is big enough for us to reject H0 becomes smaller. This means that when sample size is large, even small differences of x¯ from µ0 can be statistically significant. Example: Given σ = 30, α = 0.05, we get that For n = 20: we reject H0 if ∣∣∣X¯ − µ0∣∣∣ > 1.96×30√20 = 13.148 For n = 320: we reject H0 if ∣∣∣X¯ − µ0∣∣∣ > 1.96×30√320 = 3.287 We can see that when n = 20, the difference between x¯ and µ0 should be more than 13.148 for us to reject H0. However, when n = 320, the difference between x¯ and µ0 becomes much smaller (3.287) for us to reject H0. So, with very large sample, we may reject H0 when there is only a very small difference between sample mean x¯ and hypothesised population mean µ0. We are statistically confident that a difference exists, but it may be so small that it is of no practical importance. In other words, the difference may not be practically significant, even if it is statisti- cally significant. When the difference is large enough to be meaningful in real life, we may conclude 32 that results are practically significant. However, it may be subjective to determine what meaningful is and it may depend on the context. Example: A hypothesis test was performed to determine whether there is a difference between women’s shoe sizes and actual foot sizes for which the specific shoe size fits comfortably. For a random sample of 10 000 women with UK shoe size 6 (which corresponds to 24.1 cm), the sample mean of foot sizes was 24.0 cm. The test indicated that there is statistically significant difference between women’s UK shoe size 6 and their actual foot size. However, this difference is not practically significant, because in real life we would comfortably wear size 6 even if it is slightly bigger than our actual foot size. 6.3 Confidence Intervals and Hypothesis Testing Confidence intervals can be used for hypothesis testing, in case of two-sided tests. However, using confidence intervals for hypothesis testing is only practical when a significance level α is given in advance. In particular, a two-sided test at α can be performed using a confidence interval with confidence level 1− α. Example: Using the previous example about cholesterol level, where n = 20, x¯ = 185, σ = 30 (units of mg/dl), we want to test H0 : µ = 170 versus H1 : µ 6= 170. We previously computed the 95% confidence interval: [171.852, 198.148]. Since the hypothesised value 170 is not in the confidence interval, we are 95% confident that µ is not equal to 170, and so we reject H0 at the 5% significance level (corresponds to 95% confidence). If we were to test H0 : µ = 172 versus H1 : µ 6= 172, then we could not reject H0 at the 5% level, because 172 is in the 95% confidence interval for µ. General rule: The 1 − α confidence interval consists of all values µ0 such that H0 : µ = µ0 is not rejected at α level in a two-sided test. And so, when the value µ0 is outside the 1 − α confidence interval, then H0 : µ = µ0 is rejected at α level in a two-sided test. QUIZ 33 7 Two Sample Tests for Means Two sample tests for means are used when we want to compare two samples to assess whether their population means differ. There are two main different types of tests for comparison of two means: paired test – used when samples are paired or matched, meaning that an observation in one sample has a corresponding value in the other sample. This is often when there are two observations on each same individual, e.g. the same patient has blood pressure taken before and after taking a drug. Another situation may be when observations are matched in pairs, e.g. height of identical twins. independent two-sample test – used when samples are independent, meaning that obser- vations in both samples are not related in any way, e.g. height of men and women. We need to use the context of the experiment to determine whether the samples are paired or independent. 7.1 Paired t-test In the following we set up the theoretical background for this test. Let us suppose n people have their blood pressure recorded before and after a treatment for hyper- tension. We introduce the following notation: sample ‘before’: x1, . . . , xn, sample ‘after’: y1, . . . , yn, for each patient i, the change in blood pressure is di = yi − xi, mean blood pressure before treatment is µx; mean after is µy. Our goal is to determine whether treatment has succeeded in lowering blood pressure, that is we test H0 : µy = µx versus H1 : µy < µx. Denoting µd = µy − µx, we want to test H0 : µd = 0 versus H1 : µd < 0 and this can be done using the (one-sample) t-test, by finding the mean and standard deviation of the sample of differences, denoted by d¯ and sd, and then using these in the test statistic T = d¯ sd/ √ n ∼ tn−1. Similarly, we could test against H1 : µd > 0 or H1 : µd 6= 0. Assumption for paired t-test is: the differences d1, . . . , dn are independent of each other and are a sample from a normal distribution of mean µd = µy − µx, with (unknown) variance σ2d. Note that this test is correct as long the differences are normally distributed and independent of each other, even if the individual samples are not from a normal distribution. To check this assumption, you can draw a histogram of differences d1, . . . , dn and see whether the histogram has a similar shape to that of a normal distribution. However, this is an informal way to assess whether a sample comes from a normal distribution and sometimes it is very difficult to judge 34 this by just looking at the histogram. There exists a formal statistical test to assess this and it will be discussed later. The same as the one-sample t-test, the paired t-test is robust to non-normality. This means that if the differences are not from a normal distribution, it is still OK to use the paired t-test, provided the sample size n ≥ 30. Let us summarise how to conduct a paired t-test. To carry out a paired t-test we use the following steps: 1. State hypotheses, the null H0 : µd = 0, where µd is the mean of the population from which the differences are sampled, and the alternative H1 : µd < 0, H1 : µd > 0, or H1 : µd 6= 0. 2. Calculate the differences between each pair of observations, along with the mean and standard deviation of these differences, d¯ and sd. 3. Calculate the test statistic T = d¯ sd/ √ n . 4. Compare the value of the test statistic to the critical value in the t-distribution tables for the relevant significance level, remembering to half the significance level for a two-tailed test. Alternatively, compute the p-value. 5. Decide whether to reject or not to reject the null hypothesis at α. As shown before T ∼ tn−1 and so we use critical values tn−1(α) or tn−1(α/2) as following: For the alternative H1 : µd < 0, we reject H0 at α if T < −tn−1(α). For the alternative H1 : µd > 0, we reject H0 at α if T > tn−1(α). For the alternative H1 : µd 6= 0, we reject H0 at α if |T | > tn−1(α2 ). If we calculate the exact p-value, then, of course, we use general rules: That is if p ≤ α, then we reject H0 at α, otherwise we fail to reject H0 at α. 6. Interpret your conclusion practically in the context of the question. Note: If you are not given α, find the lowest level at which we can reject H0 and the highest level at which we cannot reject H0, following the rules in step 5 for using critical values. Example: Two different analytical methods can be used to determine the percentage of iron in chemical compounds. Do they give, on average, different results? Compound no. % iron % iron Difference (i) Method A (xi) Method B (yi) di = yi − xi 1 13.6 13.5 −0.1 2 12.9 12.6 −0.3 3 12.2 12.2 0.0 4 9.9 10.1 0.2 5 3.9 3.7 −0.2 6 5.4 5.1 −0.3 7 8.0 7.9 −0.1 8 8.8 8.7 −0.1 9 12.0 11.6 −0.4 10 5.1 5.1 0.0 35 First note that the samples are paired since we have two measurements for each compound; for each compound i we have the iron percentage measured by Method A and by Method B. As the next step we should check whether the assumption for the test is satisfied, that is whether the differences are from a normal distribution. We construct the histogram of differences (using R). We can see that the histogram looks reasonably symmetrical, and not too different from the shape of the normal distribution. It seems that the differences come from a normal distribution. Since we are interested whether the two methods give different percentages of iron, we use the two-sided test. So, we test H0 : µd = 0 versus H1 : µd 6= 0. For the differences we compute d¯ and sd: d¯ = 1 10 10∑ i=1 di = −0.13, sd = √√√√1 9 10∑ i=1 (di − d¯)2 = 0.176698. The test statistic is T = −0.13 0.176698/ √ 10 = −2.327 which we compare with critical values obtained from tn−1 = t9 distribution. Using the tables, we find: for a two-sided test at 5% level, the critical value is t9(0.025) = 2.262, for a two-sided test at 2% level, the critical value is t9(0.01) = 2.821. 36 Since 2.821 > |−2.327| > 2.262, we reject H0 at the 5% level but not at the 2% level. So, the p-value is in (0.02, 0.05). There is evidence that the two methods give different percentage iron measurements. Let us compute the exact p-value. The test statistic is T = −2.327, so the p-value is p = Pr (t9 ≤ −2.327 or t9 ≥ 2.327) = 2× (1− Pr (t9 ≤ 2.327)) = 2× (1− 0.9775163) (using R to find Pr (t9 ≤ 2.327)) = 0.0449674 ≈ 0.045. Since 0.02 < p < 0.05, it confirms that we reject H0 at the 5% level (but not at the 2% level or even the 4% level). VIDEO (example) Confidence interval for difference of means Adapting the formula for CI for the population mean (with unknown σ2), we get the formula for the 100(1− α)% confidence interval for µd:[ d¯− tn−1 ( α 2 ) sd√ n , d¯+ tn−1 ( α 2 ) sd√ n ] . Example: Find the 95% confidence interval for µd in the previous example. Using the above formula, we have that the 95% CI for µd is d¯± t9(0.025) sd√ n = −0.13± 2.262× 0.176698√ 10 = [−0.2564, −0.0036] . If we were to asked to test H0 : µd = 0 versus H1 : µd 6= 0 at the 5% level, we could use the above CI to find a conclusion. We can see that this CI does not contain zero, which means that we can reject H0 at the 5% level, thus confirming evidence of non-zero difference between Methods A and B. QUIZ 7.2 Independent two sample tests Now we will discuss the test to compare two independent samples coming from different populations, to determine whether the population means differ. We introduce the following notation: Sample 1: X11, . . . , X1n1 Sample 2: X21, . . . , X2n2 Sample sizes n1, n2 with sample means X¯1, X¯2 Assumptions for the independent two sample t-test are: The samples are independent, each sample comes from a normal distribution with the same variance, that is, mathematically, X1i ∼ N (µ1, σ2) and X2i ∼ N (µ2, σ2) independently. 37 Note: we are assuming same variance. We want to test H0 : µ1 = µ2 versus some alternative, e.g. H1 : µ1 6= µ2. Since all observations are independent, we have X¯1 − X¯2 ∼ N ( µ1 − µ2, σ 2 n1 + σ2 n2 ) . (Note that here we are using a known fact that the linear combination of independent normally distributed random variables is normally distributed.) This means that the sampling distribution of X¯1−X¯2 has mean µ1−µ2 and standard error σ √ 1 n1 + 1 n2 . Exercise: Show that E[X¯1 − X¯2] = µ1 − µ2 and Var[X¯1 − X¯2] = σ2n1 + σ 2 n2 . Similarly as before, there are two cases of the test, depending on whether σ is known or not. 7.2.1 Known common standard deviation – independent two-sample Z-test Using a similar method for deriving the confidence interval for µ, we can find the formula for the confidence interval for the difference of means µ1 − µ2. A 100(1− α)% confidence interval for µ1 − µ2 is ( X¯1 − X¯2 ) ± z ( α 2 ) σ √ 1 n1 + 1 n2 , that is, ( X¯1 − X¯2 ) ± z ( α 2 ) × s.e. ( X¯1 − X¯2 ) . To test H0 : µ1 = µ2 versus a one-sided or two-sided alternative, we use the two-sample Z-statistic Z = X¯1 − X¯2 σ √ 1 n1 + 1 n2 . Note that under H0, this statistic has standard normal distribution N(0, 1). Example: We want to compare the concentration of red blood cells in men and women. The random sample of men and a random sample of women were taken and their concentration of red blood cells was measured. The summary statistics were obtained: Males Females Sample mean 45.8 40.6 Sample s.d. 2.8 2.9 Sample size 35 25 Assume that the concentration of red blood cells in men and women are each normally distributed and suppose it is known that σ = 3.0. Find the 95% confidence interval for the difference between the population mean concentration in men and women. Determine whether the concentration of red blood cells in men and women is different. Let us denote statistics for men with subscript 1 and for women with subscript 2. The 95% CI for µ1 − µ2 is (45.8− 40.6)± 1.96× 3.0 √ 1 35 + 1 25 = [3.6603, 6.7397]. 38 We are 95% confident that the difference between the population mean concentration of red blood cells in men and women is between 3.6603 and 6.7397. If we were asked to test at the significance level 5%, we could have used the above interval to test our hypothesis. Since zero is not in the confidence interval, at the 5% level we reject H0 : µ1 = µ2 in favour of H1 : µ1 6= µ2. This gives evidence that population means differ. However, since we were not given α, we will look for the lowest level at which we can reject H0, by calculating the test statistic and comparing it to critical values. We test H0 : µ1 = µ2 versus H1 : µ1 6= µ2. The test statistic is Z = 45.8− 40.6 3 √ 1 35 + 1 25 = 6.6193. We compare this with critical values obtained from the standard normal tables. From tables z(0.000005) = 4.4172 which is the smallest level available in our tables. Since z(0.000005) = 4.4172 < 6.6193, we can reject H0 at the 0.001% level. This means that the p-value is less than 0.001, sug- gesting that there is very strong evidence that the mean concentration of red blood cells in men and women is different. 7.2.2 Unknown common standard deviation – independent two-sample t-test Usually, we do not know σ and so, we will have to estimate it. If it is reasonable to assume both populations have same variances (we will show how to check this in the following chapter), we use pooled estimate of variance: s2p = ∑n1 i=1 ( X1i − X¯1 )2 + ∑n2 i=1 ( X2i − X¯2 )2 n1 + n2 − 2 = (n1 − 1) s21 + (n2 − 1) s22 n1 + n2 − 2 . s2p is a better estimate of variance than individual sample variances when assuming the equality of variances. s2p is a weighted average of individual sample variances, with weights n1 − 1 and n2 − 1. s2p is an unbiased estimator of σ 2. Estimate of standard deviation is σˆ = sp. Estimated standard error of ( X¯1 − X¯2 ) is sp √ 1 n1 + 1 n2 (just replace σ with its estimate sp in s.e. ( X¯1 − X¯2 ) ). A 100(1− α)% CI for µ1 − µ2 is ( X¯1 − X¯2 ) ± tn1+n2−2 ( α 2 ) sp √ 1 n1 + 1 n2 , that is, ( X¯1 − X¯2 ) ± tn1+n2−2 ( α 2 ) × e.s.e. ( X¯1 − X¯2 ) . To test H0 : µ1 = µ2 versus an alternative, we use the two-sample t-statistic T = X¯1 − X¯2 sp √ 1 n1 + 1 n2 . 39 Under H0, this statistic has t-distribution with (n1 + n2 − 2) degrees of freedom. This test is called the independent two-sample t-test. VIDEO (proof that T follows tn1+n2−2) Let us summarise how to conduct an independent two-sample t-test. To carry out an independent two-sample t-test we use the following steps: 1. State hypotheses, the null H0 : µ1 = µ2, and the alternative H1 : µ1 < µ2, H1 : µ1 > µ2, or H1 : µ1 6= µ2. 2. For each sample calculate the sample means, the sample variances, and then the pooled sample variance s2p given by s2p = (n1 − 1) s21 + (n2 − 1) s22 n1 + n2 − 2 . 3. Calculate the test statistic T = X¯1 − X¯2 sp √ 1 n1 + 1 n2 . 4. Compare the value of the test statistic to the critical value in the tn1+n2−2 distribution tables for the relevant significance level, remembering to half the significance level for a two-tailed test. Alternatively, compute the p-value. 5. Decide whether to reject or not to reject the null hypothesis at α. As shown before T ∼ tn1+n2−2 and so we use critical values tn1+n2−2(α) or tn1+n2−2(α/2) as following: For the alternative H1 : µ1 < µ2, we reject H0 at α if T < −tn1+n2−2(α). For the alternative H1 : µ1 > µ2, we reject H0 at α if T > tn1+n2−2(α). For the alternative H1 : µ1 6= µ2, we reject H0 at α if |T | > tn1+n2−2(α2 ). If we calculate the exact p-value, then, of course, we use general rules: That is if p ≤ α, then we reject H0 at α, otherwise we fail to reject H0 at α. 6. Interpret your conclusion practically in the context of the question. Note: If you are not given α, find the lowest level at which we can reject H0 and the highest level at which we cannot reject H0, following the rules in step 5 for using critical values. Robustness of the independent two-sample t-test Let us recall that the assumptions for the independent two-sample t-test are that the samples are independent, each sample comes from a normal distribution with the same variance. The independent two-sample t-test is robust to non-normality. This means that if the samples are not from a normal distribution, it is still OK to use the test, provided the sample sizes n1 ≥ 30 and n2 ≥ 30. The independent two-sample t-test with pooled variance estimate s2p is reasonably robust to unequal population variances, at least when sample sizes n1, n2 are similar. 40 There is another form of the independent two-sample t-test for comparing means when the population variances are not equal – Welch’s test (not covered in this module). Example: We use the previous example about the concentration of red blood cells in men and women. However, this time we do not know σ. Let us remind that n1 = 35, n2 = 25, X¯1 = 45.8, X¯2 = 40.6, s1 = 2.8, s2 = 2.9. Assume that the concentration of red blood cells in men and women are each normally distributed and the variances are equal. Find the 95% confidence interval for the difference between the population mean concentration in men and women. Determine whether the concentration of red blood cells in men and women is different. First we calculate the pooled sample variance: s2p = 34× 2.82 + 24× 2.92 35 + 25− 2 = 8.0759 ⇒ sp = 2.8418 So, the 95% confidence interval for µ1 − µ2 is (45.8− 40.6)± t35+25−2(0.025)× 2.8418 √ 1 35 + 1 25 = 5.2± t58(0.025)× 2.8418 √ 1 35 + 1 25 ≈ 5.2± 2.000× 2.8418 √ 1 35 + 1 25 = [3.712, 6.688]. Note: Since our tables don’t include t58, we use the closest available value which is t60(0.025) = 2.000. We are 95% confident that the difference between the population mean concentration of red blood cells in men and women is between 3.712 and 6.688. If we were asked to test at the significance level 5%, we could have used the above interval to test our hypothesis. Since zero is not in the confidence interval, at the 5% level we reject H0 : µ1 = µ2 in favour of H1 : µ1 6= µ2. This gives evidence that population means differ. To test H0 : µ1 = µ2 against H1 : µ1 6= µ2, we compute the test statistic: T = 45.8− 40.6 2.8418 √ 1 35 + 1 25 = 6.988. Since t58 is not available in our tables, we use the closest available value which is t60(0.0005) = 3.460. Since T > t60(0.0005), we reject H0 at the 0.1% level. That is, p < 0.001, and so there is very strong evidence of a difference in mean red blood cell concentrations between males and females. VIDEO (example) QUIZ 7.3 Paired versus independent tests Let us show on a previous example for paired samples of iron content, what happens if we (incorrectly) treat samples as independent. 41 We test H0 : µA = µB versus H1 : µA 6= µB where µA is the population mean iron percentage for Method A and µB is the population mean iron percentage for Method B. For these data (iron content of compounds measured by Method A and B), we find the following summary statistics: Method A Method B Sample mean 9.18 9.05 Sample variance 12.368 12.356 Sample size 10 10 The pooled sample variance is s2p = 9× 12.368 + 9× 12.356 18 = 12.3605. The two-sample t-statistic is T = 9.18− 9.05√ 12.3605 √ 1 10 + 1 10 = 0.0827. This will be compared with the critical values obtained from t10+10−2 = t18 distribution. From tables, we find the critical value t18(0.05) = 1.734. Since 1.734 > T , we cannot reject H0 at the 10% level. (In fact, we cannot reject H0 at even 80% level.) We conclude that there is no evidence that the two methods give different percentage iron mea- surements. However, when we took into account pairing, we did find evidence to reject H0 at the 5% level. So, in this example, if we performed an independent t-test instead of a paired t-test, we would get a completely different conclusion. The confidence interval would be also different. When treating the samples as independent, the 95% confidence interval for µA − µB is 0.13± t18(0.025) √ 12.3605 √ 1 10 + 1 10 = 0.13± 2.101× √ 12.3605 √ 0.2 = [−3.173, 3.433]. Taking into account pairing, the 95% confidence interval was [−0.2564, −0.0036], which is much narrower compared to [−3.173, 3.433]. We can see that pairing provides better information and so, it is important to recognise when samples are paired so an appropriate test can be used. 42 8 Inference for Variances 8.1 Inference for variance of a single population For a single population, we estimate σ2 by the sample variance s2. We have shown before that s2 is unbiased for σ2. Before we discuss inference for variance, first we will have to find the sampling distribution of the sample variance s2. Sampling distribution of sample variance Suppose X1, . . . , Xn is a random sample from N (µ, σ 2), that is X1, . . . , Xn are independent and Xi ∼ N (µ, σ2) for all i. We want to show that (n−1)s 2 σ2 ∼ χ2n−1. Proof that (n− 1)s2 σ2 ∼ χ2n−1: We know Zi = Xi−µ σ ∼ N(0, 1) for each i, with Z1, . . . , Zn independent. Hence n∑ i=1 Z2i = n∑ i=1 (Xi − µ)2 σ2 ∼ χ2n from definition of chi-squared distribution. Notice that n∑ i=1 (Xi − µ)2 = n∑ i=1 ( Xi − X¯ + X¯ − µ )2 = n∑ i=1 ( Xi − X¯ )2 + 2 n∑ i=1 ( Xi − X¯ ) ( X¯ − µ ) + n∑ i=1 ( X¯ − µ )2 = n∑ i=1 ( Xi − X¯ )2 + 2 ( X¯ − µ ) n∑ i=1 ( Xi − X¯ ) + n ( X¯ − µ )2 = n∑ i=1 ( Xi − X¯ )2 + 2 ( X¯ − µ ) ( nX¯ − nX¯ ) + n ( X¯ − µ )2 = n∑ i=1 ( Xi − X¯ )2 + n ( X¯ − µ )2 Divide both sides of the final result by σ2 and we get 1 σ2 n∑ i=1 (Xi − µ)2 = 1 σ2 n∑ i=1 ( Xi − X¯ )2 + n σ2 ( X¯ − µ )2 . We already know that 1 σ2 n∑ i=1 (Xi − µ)2 ∼ χ2n. We also know X¯ ∼ N (µ, σ2/n), and so X¯−µ σ/ √ n ∼ N(0, 1). Now squaring this random variable and using the definition of the χ2-distribution (with one random variable only), we have that ( X¯ − µ σ/ √ n )2 ∼ χ21, that is, n σ2 ( X¯ − µ )2 ∼ χ21. Next, we will use following results: 43 s2 and X¯ are independent. If U, V are independent random variables with U + V ∼ χ2a+b and V ∼ χ2a then U ∼ χ2b . Proofs of the above results are omitted since they are complicated and outside syllabus of this module. Notice that (n− 1)s2 σ2 = 1 σ2 n∑ i=1 ( Xi − X¯ )2 and so far we have found 1 σ2 n∑ i=1 (Xi − µ)2︸︷︷︸ ∼χ2n = (n− 1)s2 σ2 + n σ2 ( X¯ − µ )2 ︸︷︷︸ ∼χ21 . Now applying the above result with U = (n−1)s 2 σ2 and V = n σ2 ( X¯ − µ )2 , we get that (n−1)s 2 σ2 ∼ χ2n−1 . Confidence interval for the population variance Now we use the distribution of s2 to work out a confidence interval for σ2. For the 100(1− α)% confidence interval we have Pr ( χ2n−1 ( 1− α 2 ) ≤ (n− 1)s 2 σ2 ≤ χ2n−1 ( α 2 )) = 1− α ⇒ Pr ( 1 χ2n−1(1− α/2) ≥ σ 2 (n− 1)s2 ≥ 1 χ2n−1(α/2) ) = 1− α ⇒ Pr ( (n− 1)s2 χ2n−1(α/2) ≤ σ2 ≤ (n− 1)s 2 χ2n−1(1− α/2) ) = 1− α So, the 100(1− α)% confidence interval for σ2 is[ (n− 1)s2 χ2n−1(α/2) , (n− 1)s2 χ2n−1(1− α/2) ] . The 100(1− α)% confidence interval for σ is √√√√ (n− 1)s2 χ2n−1(α/2) , √√√√ (n− 1)s2 χ2n−1(1− α/2)  . VIDEO (deriving the confidence interval for σ2) Example: Cholesterol level in a particular population was known to be normally distributed. A random sample of 20 people was taken and their cholesterol level was measured. The sample mean was 182 and the sample standard deviation was 29.3. Find the 90% confidence interval for the population standard deviation of cholesterol level. We are given n = 20, x¯ = 182, s = 29.3. First, we use tables to look up the relevant critical values: χ219(0.05) = 30.14 and χ 2 19(0.95) = 10.12. Using the formula, the 90% CI for σ2 is[ (20− 1)29.32 χ219(0.05) , (20− 1)29.32 χ219(0.95) ] = [ 19× 29.32 30.14 , 19× 29.32 10.12 ] = [541.1848, 1611.7895]. 44 So, the 90% CI for σ is [√ 541.1848, √ 1611.7895 ] = [23.263, 40.147]. Interpretation: We are 90% confident that the population standard deviation of cholesterol level is between 23.263 and 40.147. Notice that the confidence interval for σ is not symmetric about s = 29.3, because the χ2- distribution is skewed. QUIZ 8.2 Comparison of two population variances For independent two samples t-test, we assumed both populations had the same (unknown) variance. In this section we will discuss how we can check this. We use the notation introduced previously: Sample 1: X11, . . . , X1n1 Sample 2: X21, . . . , X2n2 With X1i ∼ N (µ1, σ21) and X2i ∼ N (µ2, σ22) independently We want to test whether two population variances are the same, that is, we test H0 : σ 2 1 = σ 2 2 versus H1 : σ 2 1 6= σ22. Equivalently, we can write H0 : σ21 σ22 = 1 versus H1 : σ21 σ22 6= 1. For each sample, we know (from one-sample theory in previous section) (n1 − 1) s21 σ21 ∼ χ2n1−1 (n2 − 1) s22 σ22 ∼ χ2n2−1 independently. Using the definition of F -distribution, we have (n1−1)s21 σ21 / (n1 − 1) (n2−1)s22 σ22 / (n2 − 1) ∼ Fn1−1, n2−1 ⇒ s 2 1 /s 2 2 σ21 /σ 2 2 ∼ Fn1−1, n2−1 Confidence interval for the ratio of variances We use the above result to find the confidence interval for the ratio of variances σ21 /σ 2 2 . For 100(1− α)% CI for σ21 /σ22 we have: Pr ( Fn1−1, n2−1(1− α/2) ≤ s21 /s 2 2 σ21 /σ 2 2 ≤ Fn1−1, n2−1(α/2) ) = 1− α ⇒ Pr ( s22 s21 Fn1−1, n2−1(1− α/2) ≤ σ22 σ21 ≤ s 2 2 s21 Fn1−1, n2−1(α/2) ) = 1− α ⇒ Pr ( s21 s22 1 Fn1−1, n2−1(α/2) ≤ σ 2 1 σ22 ≤ s 2 1 s22 1 Fn1−1, n2−1(1− α/2) ) = 1− α 45 The critical values Fm,n(1− α/2) can be found using Fm,n(1− α/2) = 1 Fn,m(α/2) . (See tutorial sheet 5 for proof of this.) So, the 100(1− α)% CI for σ21 /σ22 is therefore[ s21 s22 1 Fn1−1, n2−1(α/2) , s21 s22 Fn2−1, n1−1(α/2) ] . For example, notice that F6,7(0.95) does not appear in our tables. However, this is the same as 1 F7,6(0.05) , where F7,6(0.05) is in our tables. VIDEO (deriving the confidence interval for σ21 /σ 2 2 ) Example: For two plants, A and B, we randomly choose 7 pods of plant A, 6 pods of plant B and count number of seeds in each pod. The following summary statistics were obtained: Plant A: n1 = 7, ∑n1 i=1 (x1i − x¯1)2 = 5.0141 Plant B: n2 = 6, ∑n2 i=1 (x2i − x¯2)2 = 0.6584 Assume that the number of seeds per pod is normally distributed for both plants. We want to test at the 10% level whether the variances of numbers of seeds per pod are the same for both plants. Use an appropriate confidence interval to test this. We test H0 : σ21 σ22 = 1 versus H1 : σ21 σ22 6= 1 where σ21 and σ22 are population variances of seeds count per pod for plant A and B, respectively. First we find the point estimate of the ratios of variances, that is: s21 s22 = 1 n1−1 ∑n1 i=1 (x1i − x¯1)2 1 n2−1 ∑n2 i=1 (x2i − x¯2)2 = 5.0141/6 0.6584/5 = 6.3463. Now we find relevant critical values: Fn1−1, n2−1(0.05) = F6, 5(0.05) = 4.950, Fn2−1, n1−1(0.05) = F5, 6(0.05) = 4.387. Using the formula, we get that the 90% CI for σ21 /σ 2 2 is[ 6.3463 4.950 , 6.3463× 4.387 ] = [1.282, 27.841]. The interval does not contain 1. This means that the hypothesis H0 : σ 2 1 = σ 2 2 is rejected at the 10% level in a two-sided test. This suggests that there is evidence, but weak, that the population variances of numbers of seeds per pod are not equal for the two plants. QUIZ 46 F -test for comparison of two population variances To test H0 : σ 2 1 = σ 2 2 versus an alternative, we use a test statistic derived at the beginning of this chapter: s21/s22 σ21/σ22 ∼ Fn1−1, n2−1. Under H0 we have σ 2 1 σ22 = 1 and so, the test statistic is F = s21 s22 . This test statistic is then compared with critical values obtained from F -distribution on (n1 − 1, n2 − 1) degrees of freedom. The test for comparison of two population variances is often called the F -test. Assumptions: The samples are independent and both samples come from a normally distributed population. To carry out a F -test for comparison of two population variances we use the following steps: 1. State hypotheses, the null H0 : σ21 σ22 = 1, and the alternative H1 : σ21 σ22 < 1, H1 : σ21 σ22 > 1, or H1 : σ21 σ22 6= 1. 2. For each sample calculate the sample variances s21 and s 2 2. 3. Calculate the test statistic F = s21 s22 . 4. Compare the value of the test statistic to the critical value in the Fn1−1, n2−1 distribution tables for the relevant significance level, remembering to half the significance level for a two-tailed test. Alternatively, compute the p-value. 5. Decide whether to reject or not to reject the null hypothesis at α. As shown before F ∼ Fn1−1, n2−1 and so we use critical values from the Fn1−1, n2−1-distribution as following: For the alternative H1 : σ21 σ22 < 1, we reject H0 at α if F < Fn1−1, n2−1(1− α). For the alternative H1 : σ21 σ22 > 1, we reject H0 at α if F > Fn1−1, n2−1(α). For the alternative H1 : σ21 σ22 6= 1, we reject H0 at α if F < Fn1−1, n2−1(1 − α/2) or F > Fn1−1, n2−1(α/2). If we calculate the exact p-value, then, of course, we use general rules: That is if p ≤ α, then we reject H0 at α, otherwise we fail to reject H0 at α. 6. Interpret your conclusion practically in the context of the question. Notes: Note that Fn1−1, n2−1(1− α) is not in our tables, so we look for Fn2−1, n1−1(α) and then use the relation Fn1−1, n2−1(1− α) = 1Fn2−1, n1−1(α) . 47 For the two-sided test, if we put the larger sample variance in the numerator of the test statis- tic F = s21 s22 , then the rejection rule is simplified and we reject H0 at α if F > Fn1−1, n2−1(α/2). This is because placing the larger sample variance in the numerator forces the F -test into the upper-tailed test which uses the critical values in the upper tail of the F -distribution that can be found directly in our tables. If you are not given α, find the lowest level at which we can reject H0 and the highest level at which we cannot reject H0, following the rules in step 5 for using critical values. Sensitivity of the F -test for equality of variances: Unlike t-tests for means, the F -test for equality of variances is not robust. The F -test for equality of variances is extremely sensitive to non-normality of the populations, whatever the sample size. That is why the F -test and confidence intervals for ratio of variances are not very useful in practice. Example: We use the previous example about the seeds per pod for two plants. This time we want to use the F -test to find whether the variances of numbers of seeds per pod are the same for both plants. We are not given the significance level. We test H0 : σ21 σ22 = 1 versus H1 : σ21 σ22 6= 1 where σ21 and σ22 are population variances of seeds count per pod for plant A and B, respectively. We have calculated the test statistic before: F = 6.3463 where the larger sample variance was in the numerator. The relevant critical values from available tables are F6,5(0.05) = 4.950, F6,5(0.025) = 6.978, F6,5(0.01) = 10.672. We can see that 4.950 < F = 6.3463 < 6.978 and so we can reject H0 at the 10% level, but not at the 5% level. And so, 0.05 < p < 0.10. There is evidence, but weak, that the population variances of numbers of seeds per pod are not equal for the two plants. VIDEO (example) QUIZ 48 9 Normal probability plot and normality test In this chapter we will discuss how we can check whether a sample is from a normal distribution. To check normality, we can plot a histogram for the sample, and see if it looks roughly like the symmetrical bell-shape of the normal distribution. However, it may be difficult to judge this by eye. A better method is to construct a normal probability plot for the sample and/or to perform a normality test. Normal probability plot Consider a sample of data x1, x2, . . . , xn and order the observations from smallest to largest x(1) ≤ x(2) ≤ · · · ≤ x(n). Now we will consider the sample cumulative distribution function. Recall that the cumulative distribution function (for a population) is F (x) = Pr(X ≤ x). The sample cumulative distribution function (cdf) is Fn(x) = Proportion of observations less than or equal to x. So, the proportion of observations less than or equal to x(i) is i n . To avoid ‘end effects’, we use values i n+1 instead of i n . (If we used n instead of n+1 in the denominator, then the estimated value of the sample cdf at x(n) would always be 1. Then the final point in the normal probability plot would be ( x(n), Φ −1(1) ) . But since normal distribution is continuous, Φ(x)→ 1 as x→ +∞. So, Φ−1(1) =∞ which we cannot plot.) To get a graph of the sample cdf, we plot points( x(1), 1 n+ 1 ) , ( x(2), 2 n+ 1 ) , . . . , ( x(n), n n+ 1 ) . If data are from a normal distribution, this should have the shape of the normal cdf Φ(x). Let us recall the shape of Φ(x): 49 However, looking at the shape of the sample cdf, it is difficult to judge whether it looks like the normal curve Φ(x). So instead, we make a transformation of the vertical axis, and plot points( x(1), Φ −1 ( 1 n+ 1 )) , ( x(2), Φ −1 ( 2 n+ 1 )) , . . . , ( x(n), Φ −1 ( n n+ 1 )) . This plot is called the normal probability plot. If data are from a normal population, the points should form a straight line (approximately). Systematic deviations from a straight line indicate non-normal data. We could produce a normal probability plot using normal tables and graph paper, but it would be very time consuming. So, in practice the normal probability plots are done using statistical software. Example: Let us construct the normal probability plot for the sample: 12.1, 12.5, 12.8, 12.4, 11.3, 10.3, 10.4, 11.2, 9.5, 11.7, 11.2 We use R to do it. We can see that the points are reasonably close to a straight line, so it seems reasonable to assume that the data come from a normal distribution. Normality test To formally test whether data come from a normal distribution, we perform a normality test, giving us a p-value. There are a few normality tests, in this module we will use Anderson-Darling test for normality. The normality test is testing H0 : Data normal against H1 : Not normal. 50 So, if p-value is bigger than 0.1, then this indicates that there is no evidence against normality and we may conclude that it seems that our data are from a normal distribution. Note: We will not cover the theory behind the normality test since it is complicated and outside syllabus of this module. You just need to know what the normality test is used for, how to perform it using R and how to interpret the results. Example: For the above sample, using R, we find that the p-value for Anderson-Darling test for normality is p = 0.738. We cannot reject H0 at the 10% level, or the 20% level, or even the 70% level. There is no evidence that the data do not come from a normal distribution. So, we can conclude that it seems that the data are from a normal distribution. VIDEO (example) QUIZ 51 10 Comparison of k population means: ANOVA In this chapter, we will discuss a statistical methodology for comparing means of several populations. This method is called analysis of variance, or simply ANOVA. 10.1 Introducing ANOVA Model We motivate the discussion on a following real-world example. ANOVA Example The table below shows activity levels (measured in non-standard units) of rats on various doses of THC (tetrahydrocannabinol, the major active ingredient of marijuana). Placebo 0.1 µg 0.5 µg 1 µg 2 µg 30 60 71 33 36 27 42 50 78 27 52 48 38 71 60 38 52 59 58 51 20 28 65 35 29 26 93 58 35 34 8 32 74 46 24 41 46 67 32 17 49 63 61 50 49 ni 10 9 9 8 9 x¯i 34.000 51.556 60.333 48.500 36.444 si 14.298 19.340 11.068 18.323 14.293 We want to investigate whether various doses of THC have different effect on activity levels of rats. In other words, we want to know whether the population means of activity levels differ. Note that we have k = 5 groups. We denote the mean activity level of placebo group, the mean activity level for groups with doses of 0.1 µg, 0.5 µg, 1 µg, 2 µg by µ1, µ2, µ3, µ4, µ5 respectively. So, we want to test H0 : µ1 = µ2 = µ3 = µ4 = µ5 versus H1 : µi not all equal. For the first informal assessment of data, we can use boxplots which help us see the within-group variation. 52 VIDEO (main idea behind ANOVA) ANOVA model Now we will introduce the ANOVA model which provides a convenient way to introduce notations and assumptions which are foundation for data analysis. When analysing data, we look for an overall pattern and deviations from it. This can be expressed by a general equation data = fit+ residual. For example, the statistical model for a random sample of observations X1, . . . , Xn from a single normally distributed population with mean µ and variance σ2 is given by Xi = µ+ i where i ∼ N(0, σ2). The fit part is represented by µ and the residual part by i which represents the deviations of the data from the fit (due to random variation). There are unknown parameters µ and σ2 in this model which can be estimated by the sample mean and the sample variance. i are estimated by residuals ei = xi − x¯. The ANOVA model is derived in a similar way. We have k independent samples, from different normally distributed populations. Observations or response values are denoted in the following way: Group 1: Y11, Y12, . . . , Y1n1 Group 2: Y21, Y22, . . . , Y2n2 ... Group k: Yk1, . . . . . . , Yknk 53 We have ni observations from group i for i = 1, 2, . . . , k. Note that Yij is response j from group i. We assume that Yij ∼ N ( µi, σ 2 ) for parameters µ1, . . . , µk and σ 2 Note that we assume a common variance σ2 for all groups – this is called the homoscedastic assumption. The population mean for group i is denoted by µi. Our goal is to test whether these population means differ. A standard form of the ANOVA model is Yij = µi + ij for i = 1, 2, . . . , k and j = 1, 2, . . . , ni where µ1, . . . , µk are parameters and the ij are independent random errors with ij ∼ N (0, σ2) for each i, j. This model is sometimes called the means model. There is an alternative formulation of the model which is given by Yij = µ+ αi + ij for i = 1, 2, . . . , k and j = 1, 2, . . . , ni where µ is the overall mean and αi = µi − µ is the effect of the ith group, or the effect of the ith treatment. This model is sometimes called the effects model. In following, we will consider the means ANOVA model. Next, we estimate the parameters in the ANOVA model which is necessary for derivation of the statistical method to compare several population means. Estimating parameters in the ANOVA model – Least Squares estimation Our aim is to find estimates of µi’s (denoted by µˆi) such that the errors are small as possible. A standard way how to do this is to use the Least Squares method where we minimise the sum of squared errors. We have S = k∑ i=1 ni∑ j=1 2ij = k∑ i=1 ni∑ j=1 (yij − µi)2 . Then our estimates µˆi are those values which make S as small as possible. To find minimum point, we differentiate S: dS dµi = ni∑ j=1 −2 (yij − µi) for all i From calculus we know that at a minimum or maximum point, these derivatives are all zero. Note that S is a quadratic function, with a unique minimum point and no local maximum point. 54 Setting dS/dµi = 0 gives ni∑ j=1 (yij − µi) = 0 for each i ⇒  ni∑ j=1 yij − niµi = 0 ⇒ µ̂i = 1 ni ni∑ j=1 yij = yi· where yi· = 1 ni ni∑ j=1 yij is the group i sample mean. A dot in a subscript position of yi· indicates that we take the average over all values of the subscript. This kind of notation is commonly used and will appear throughout this chapter. To sum up, the Least Squares parameter estimates for the ANOVA model are µ̂i = yi· where yi· = 1 ni ni∑ j=1 yij. For next discussion, we also define overall sample mean y·· = 1 N k∑ i=1 ni∑ j=1 yij where N = n1 + · · ·+ nk is the total sample size. 10.2 ANOVA Hypothesis Test In this section, we will introduce the theory behind the ANOVA test and how it works. As mentioned before, the ANOVA test is used to test whether group means µi are all the same or whether some of the means µi are different. To test whether group means µi are all the same, we investigate whether differences between group sample means yi· are big enough to be statistically significant. So, we want to see whether the sample means y1· , y2· , . . . , yk· are very similar, or whether they are very spread out. The way to measure the spread of a set of numbers is using their variance, but we also want to take into account the group size. Now we define the quantities which measure variability and average variability between group sample means while taking into account the group sizes: Between Groups Sum of Squares SSG = k∑ i=1 ni (yi· − y··)2 Between Groups Mean Square MSG = SSG k − 1 55 VIDEO (SSG) If MSG is big, that suggests that the sample means are too spread out and so this indicates real differences between groups. But how big is enough to be significant? To answer this, we look if MSG is big compared to the variance σ2. Since σ2 is an unknown parameter, the next step is to estimate σ2. Estimating the variance σ2 We have assumed σ2 is the same for every group. The sample variances s21, s 2 2, . . . , s 2 k estimate the same population variance σ 2. To create a better es- timate, we ‘pool’ all these sample variances, similarly as this was done in the case of two independent groups. Recall that for two independent groups, the estimate of σ2 is s2p = (n1 − 1) s21 + (n2 − 1) s22 n1 + n2 − 2 . We generalise this for k independent groups to use the estimate of σ2 given by σ̂2 = ∑k i=1 (ni − 1) s2i N − k where N = n1 + n2 + · · ·+ nk. Notice that E [ σ̂2 ] = ∑k i=1 (ni − 1)E [s2i ] N − k = ∑k i=1 (ni − 1)σ2 N − k = σ2 (∑k i=1 ni ) − k N − k = σ2 since N = k∑ i=1 ni That is, σ̂2 is an unbiased estimator of σ2. Next we define the quantities which measure the variation and average variation of data within the groups: Error Sum of Squares SSE = k∑ i=1 ni∑ j=1 (yij − yi·)2 Error Mean Square MSE = SSE N − k These are also known as the ‘Within Groups’ SS and MS. 56 VIDEO (SSE) Similarly, we define the Total Sum of Squares as SST = k∑ i=1 ni∑ j=1 (yij − y··)2 = (N − 1)s2T where s2T = 1 N−1 ∑k i=1 ∑ni j=1 (yij − y··)2. It can be shown that (not required to show this in this module) SST = SSE + SSG. Note that MSE is actually equal to σ̂2: The sample variance of group i is s2i = 1 ni − 1 nj∑ j=1 (yij − yi·)2, so our variance estimate is σ̂2 = ∑k i=1 (ni − 1) s2i N − k = ∑k i=1 ∑ni j=1 (yij − yi·)2 N − k = MSE We also have σ̂ = √ MSE. The question now actually becomes: Is MSG big compared to MSE? MSG is in fact another estimator for σ2 (showed later) and so we actually compare two different estimators of the variance, hence the name ‘analysis of variance’. To be able to answer this, we will need the distributions of MSE and MSG. It can be shown (not required) that SSE σ2 ∼ χ2N−k, if H0 is true, then SST σ2 ∼ χ2N−1 and SSG σ2 ∼ χ2k−1. Expectations of MSE and MSG Recall that the expectation of a χ2-distributed random variable is equal to its degrees of freedom. Then we have E [MSE] = E [ σ2 σ2 · SSE N − k ] = σ2 N − kE [ SSE σ2 ] = σ2 N − k · (N − k) = σ 2. If H0 is true, then also E [MSG] = E [ σ2 σ2 · SSG k − 1 ] = σ2 k − 1E [ SSG σ2 ] = σ2 k − 1 · (k − 1) = σ 2. That is, if H0 true, both MSE and MSG are unbiased estimates of σ 2, and so we expect their ratio to be around 1. If H0 is not true, MSE still estimates σ 2, but the differences between groups make MSG bigger, so we expect MSG/MSE > 1. So, if MSG is (significantly) large compared to MSE, this would indicate evidence of differences between groups. To determine this, we have to find the distribution of MSG/MSE which is the final step in deriving ANOVA test. 57 Distribution of MSG/MSE So far we know: SSG σ2 ∼ χ2k−1 and SSE σ2 ∼ χ2N−k independently Using definition of the F -distribution, SSG σ2 / (k − 1) SSE σ2 / (N − k) ∼ Fk−1, N−k ⇒ MSG MSE ∼ Fk−1, N−k Conclusion: If the ratio MSG/MSE is big compared to Fk−1, N−k critical values, that is evidence of real differences between groups. So, MSG/MSE is the test statistic for ANOVA test. We summarise the steps of the ANOVA test in the following section. ANOVA table Since there are many calculations before we arrive to the test statistic MSG/MSE, it is convenient to construct the ANOVA table which helps us to organise all necessary calculations. A general layout of the ANOVA table: Source DF SS MS F Groups k − 1 SSG SSG k − 1 MSG MSE Error N − k SSE SSE N − k Total N − 1 SST where SST = SSE+SSG, DF denotes degrees of freedom, SS denotes sum of squares, MS denotes mean squares, k is the number of groups, N = n1 + · · · + nk is the total sample size, SSG, SSE, SST are calculated using formulas introduced earlier. ANOVA hypothesis test Use: To test whether several population means are all equal or whether some population means are different. Assumptions: The groups are independent, responses of each group come from normally distributed populations and all groups have the common variance. To carry out the ANOVA test we use the following steps: 1. State hypotheses, the null H0 : µ1 = µ2 = · · · = µk versus the alternative H1 : µi not all equal. 2. Construct the ANOVA table, including the test statistic F = MSG/MSE. 58 3. Compare the value of the test statistic to the critical value in the Fk−1, N−k distribution ta- bles for the relevant significance level. Note that the ANOVA test is a one-sided test. Alternatively, compute the p-value. 4. Decide whether to reject or not to reject the null hypothesis at α. As shown before F ∼ Fk−1, N−k and so we use critical values Fk−1, N−k(α) as following: we reject H0 at α if F > Fk−1, N−k(α), otherwise we cannot reject H0 at α. If we calculate the exact p-value, then, of course, we use general rules: That is if p ≤ α, then we reject H0 at α, otherwise we fail to reject H0 at α. 5. Interpret your conclusion practically in the context of the question. Notes: If you are not given α, find the lowest level at which we can reject H0 and the highest level at which we cannot reject H0, following the rules in step 4 for using critical values. The ANOVA test is a one-sided test because in this test we actually compare two estimators of variance – MSG and MSE. We actually test whether MSG is significantly bigger than MSE, and that is why the ANOVA test is one-sided. 10.3 ANOVA Examples and Terminology 10.3.1 ANOVA Example 1 We now solve the example from the beginning of this chapter about activity levels of rats on different doses of THC. We denote the mean activity level of placebo group, the mean activity level for groups with doses of 0.1 µg, 0.5 µg, 1 µg, 2 µg by µ1, µ2, µ3, µ4, µ5 respectively. So, we want to test H0 : µ1 = µ2 = µ3 = µ4 = µ5 versus H1 : µi not all equal. There are k = 5 groups and we have the following summary statistics for each group: Placebo 0.1 µg 0.5 µg 1 µg 2 µg ni 10 9 9 8 9 x¯i 34.0000 51.5556 60.3333 48.5000 36.4444 si 14.2984 19.3398 11.0680 18.3225 14.2926 Using all responses of activity levels, we compute the overall sample mean y·· = 45.8444. Next we compute SSG and SSE and find SST , using SST = SSG+ SSE. SSG = k∑ i=1 ni (yi· − y··)2 = 10× (34.0000− 45.8444)2 + 9× (51.5556− 45.8444)2 + 9× (60.3333− 45.8444)2 +8× (48.5000− 45.8444)2 + 9× (36.4444− 45.8444)2 = 4437.4701 For SSG we have k − 1 = 4 degrees of freedom. 59 SSE = k∑ i=1 (ni − 1) s2i = 9× 14.29842 + 8× 19.33982 + 8× 11.06802 + 7× 18.32252 + 8× 14.29262 = 9796.4514 For SSE we have N − k = 45− 5 = 40 degrees of freedom. We present calculations in ANOVA table: Source DF SS MS F Groups 4 4437.4701 1109.3675 4.5297 Error 40 9796.4514 244.9113 Total 44 14233.9215 We used following formulas: SST = SSE + SSG MS = SS DF F = MSG MSE Now we compare F with critical values from F4, 40 tables. From the tables we find F4, 40(0.10) = 2.091, F4, 40(0.05) = 2.606, F4, 40(0.01) = 3.828. Since F = 4.5297 > F4, 40(0.01) = 3.828, we reject H0 even at the 1% level. There is strong evidence of differences in activity levels of rats on different doses of THC. VIDEO (example) 10.3.2 Terminology and further notes We introduce standard terminology used in relation with ANOVA: Factor: A variable by which observations are classified, e.g. “dose of THC”. Level: A particular value of a factor, e.g. “0.1µg”. Response: A quantitative variable which may be affected by the Factor, e.g. “Activity level”. Aim of ANOVA is to test whether Factors affect the Response. Balanced design: Same number of observations in each sample. E.g. ANOVA in the example about the activity levels of rats is unbalanced since n1 = 10, n2 = 9, . . . Examples: 1. In a greenhouse, yield of tomatoes may depend upon three factors: – Variety – Type of soil 60 – Type of fertilizer “Variety” may be “Yellow Pygmy”, “Plum” or ”Cherry” — a factor with three levels. “Type of soil” may be “Acid” or “Alkaline” — a factor with two levels. “Type of fertilizer” may be “Natural” or ”Synthetic” — a factor with two levels. Response is “yield” (kg). 2. Amount of rot in a potato after 5 days storage may depend upon temperature and Oxygen level during storage. Oxygen Temp 2% 6% 10% 13 10 15 10oC 11 4 3 7 26 22 20 16oC 19 18 24 24 8 Two factors: Temperature, Oxygen The factor “Temperature” has two levels: 10oC, 16oC The factor “Oxygen” has three levels: 2%, 6%, 10% Response: Amount of rot From the table we can see that this is an unbalanced design. Further notes about ANOVA: The Factor is always treated as qualitative. In the example about activity levels of rats, dosage is a continuous variable, but we ignore the actual values (0, 0.1, 0.5, 1, 2) and treat the data as coming from 5 groups, which could just be called “Group A” to “Group E”. Factors: Qualitative Response: Quantitative In this module, we only deal with one factor at a time – called One-way ANOVA. When k = 2, ANOVA gives exactly the same answer as an independent two-sample t-test. This is because if T ∼ tn−2, then F = T 2 ∼ F1,n−2. 61 10.3.3 ANOVA Example 2 Below are data for energy consumption from households in different regions of USA. We want to determine whether there are differences in mean energy consumption between three regions. Northeast Midwest South 15 17 11 10 12 7 13 18 9 14 13 13 yi· 13 15 10 s2i 4.6667 8.6667 6.6667 We denote the mean energy consumption for Northeast, Midwest and South by µ1, µ2, µ3, respectively. So, we want to test H0 : µ1 = µ2 = µ3 versus H1 : µi not all equal. In this case we have one factor: Region with three levels (k = 3): Northeast, Midwest, South. This is a balanced design, n1 = n2 = n3 = 4, so N = 12 observations in total. Ignoring groups and treating the data as one sample of size N = 12, we calculate y·· = 12.6667, s2T = 10.0606. Now we calculate two of the Sums of Squares, and calculate the third one using SST = SSG+SSE. SST = (N − 1)s2T = 11× 10.0606 = 110.6666 SSE = 3∑ i=1 (ni − 1) s2i = 3× 4.6667 + 3× 8.6667 + 3× 6.6667 = 60.0003 We construct the ANOVA table: Source DF SS MS F Groups 2 50.6663 25.3332 3.8000 Error 9 60.0003 6.6667 Total 11 110.6666 We compare F = 3.8000 with F2, 9 tables. From the tables, we get critical values F2, 9(0.10) = 3.006, F2, 9(0.05) = 4.256, F2, 9(0.01) = 8.022. Since F2, 9(0.10) = 3.006 < F = 3.8000 < F2, 9(0.05) = 4.256, we can reject H0 at the 10% level, but not at the 5% level. There is some, but weak evidence of differences in mean energy consumption between the three regions. QUIZ 62 10.4 ANOVA Assumptions and How to Check Them In this section we discuss how to check ANOVA assumptions and robustness of the ANOVA test (if assumptions do not hold, can we still use the ANOVA test?). ANOVA assumptions 1. Samples are independent. 2. Responses are normally distributed, Yij ∼ N(µi, σ2). 3. Each group has same variance. Considering the ANOVA model Yij = µi + ij, note that these assumptions are equivalent to saying that errors are independent and normally distributed with the same variance, that is ij ∼ N(0, σ2). Assumption of independence To see whether groups are independent, we will have to look at the context/design of an experi- ment. To ensure that the groups are independent, one needs to set up the experiment correctly and randomise sample selection correctly. Assumption of equality of variances There is an informal test (rule of thumb) to check this assumption: provided the largest stan- dard deviation is less than twice the smallest standard deviation, then methods based upon the assumption of equal variances/standard deviations will be OK to use. Example: We check informally whether the assumption of equal variances is satisfied for the example about the activity levels of rats on THC. We have: The largest standard deviation is s2 = 19.3398, and the smallest standard deviation is s3 = 11.0680. The ratio is 19.3398 11.0680 = 1.7474 < 2, so it seems that the assumption of equal variances is satisfied. Example: For the example about the energy consumption from USA households, we have: The largest variance: s22 = 8.6667, the smallest variance: s 2 1 = 4.6667. The ratio of standard deviations: √ 8.6667√ 4.6667 = 1.3628 < 2, so it seems that the assumption of equal variances is satisfied. There are also more formal tests to check the assumption of equal variances: Bartlett’s test or Levene’s test. These can be carried out by statistical software. Barlett’s test assumes that the data are normally distributed. Levene’s test does not require normality. We won’t cover theory of these formal tests in any more detail, just make sure you know what they are used for, how to perform these tests in R and how to interpret the results. (See Tutorial 8 to learn how to perform Bartlett’s test and Levene’s test in R.) 63 Assumption of normality To check normality, we can plot a histogram for each group separately, and see if they look roughly like the symmetrical bell-shape of the normal distribution. However, this is difficult to judge by eye, and better way is to do a normal probability plot and perform a normality test for each group. In practice, instead of looking at normal probability plots of each group, we usually look at the normal probability plot of residuals, not raw data. (Note that we assume that the errors are normally distributed and residuals are estimates of errors.) Example: For the activity levels of rats on THC, let us check the normality assumption. Using R, we do the normal probability plot of residuals. We can see that the points are reasonably close to a straight line. Using R, we find a p-value for the normality test, which is p = 0.722. Let us remind that this is testing H0 : data normal against H1 : not normal. With p = 0.722, we cannot reject H0 at 10% level, or 20% level, or even 70% level. There is no evidence against normality. It seems that the assumption of normality is satisfied. Robustness of the ANOVA F -test If assumptions of normality or equal variances do not hold, the ANOVA F -test still gives approxi- mately the correct answer. The ANOVA F -test is very robust against non-normality. – Large sample size helps – Central Limit Theorem then implies sample means are approx- imately normal. – Most robust for a balanced design (equal group sizes). The ANOVA F -test is very robust against heterogeneity of variances (variances not all equal). – Most robust for a balanced design. 64 11 Post Hoc Tests The ANOVA test only tells us if the group means appear to be the same or whether some means are different. However, if there are some significant differences between means, it doesn’t tell us exactly where these differences appear. If we want to know which of the group means are different, we have to perform further tests. We can look at the individual means and box plots to do informal assessment. However, this is not a formal statistical test of differences between groups! The formal statistical tests of differences between groups (which are performed after the ANOVA test shows significant differences between means) are called post-hoc tests. 11.1 Fisher’s Least Significant Differences Example: Energy Consumption For the Energy Consumption data, we found weak evidence (10%) of differences. Looking at the boxplots, it appears that only the South Region has considerably lower energy con- sumption, and seems to be different from Northeast and Midwest. To formally test this, we could compare each pair of regions, that is we need to perform 3 separate t-tests. The means are y1· = 13, y2· = 15 and y3· = 10, so the differences are y1· − y2· = −2; y1· − y3· = 3; y2· − y3· = 5. In independent two-sample t-test, the standard deviation is estimated using the pooled estimate sp. 65 But here we have estimated standard deviation as √ MSE = √ 6.6667 = 2.582. MSE is based on all available observations, whereas sp would use only the observations from two groups at a time. We know that MSE follows χ2-distribution on N − k degrees of freedom. So, we use the t-statistic T = yi· − yj·√ MSE √ 1 ni + 1 nj which follows t-distribution with N − k degrees of freedom. The test statistics for the Energy Consumption data are T12 = −2 2.582 √ 1/2 = −1.095; T13 = 3 2.582 √ 1/2 = 1.643; T23 = 5 2.582 √ 1/2 = 2.739. We compare these with critical values from t9-distribution: t9(0.05) = 1.833, t9(0.025) = 2.262, t9(0.01) = 2.821. For group 1 and 2, we test H0 : µ1 = µ2 vs H1 : µ1 6= µ2. Since | − 1.095| < t9(0.05) = 1.833, we cannot reject H0 at the 10% level. In other words, we can say that the difference between group 1 and 2 is not significant even at 10%. For group 1 and 3, we test H0 : µ1 = µ3 vs H1 : µ1 6= µ3. Since |1.643| < t9(0.05) = 1.833, we cannot reject H0 at the 10% level. In other words, we can say that the difference between group 1 and 3 is not significant even at 10%. For group 2 and 3, we test H0 : µ2 = µ3 vs H1 : µ2 6= µ3. Since t9(0.025) = 2.262 < |2.739| < t9(0.01) = 2.821, we can reject H0 at the 5% level, but not at the 2% level. In other words, the difference between group 2 and 3 is significant at 5%, but not at 2%. (Don’t forget that we multiply the significance level (of the critical values) by 2 since we test whether group means are different, so these are two-sided tests.) It appears that the reason we detect any differences between the energy consumption in the three regions is due the differences between Midwest (group 2) and South (group 3). There is no evidence of differences between mean energy comsumption of Northeast and Midwest, or Northeast and South. There is evidence of difference between mean energy comsumption of Midwest and South. The post-hoc test performed above is called Fisher’s Least Significant Differences (LSD). 11.2 Multiple tests One should be careful when performing multiple tests as performing many tests increases overall significance level. 66 If we perform K tests on independent samples, each test at level α, then the probability of failing to reject H0 for all tests is (1− α)K . And the probability of rejecting H0 for at least one of these tests is (i.e. making Type I error) is αK = 1− (1− α)K . Values of αK for different α and K: K α = 0.1 α = 0.05 α = 0.01 2 0.190 0.098 0.020 3 0.271 0.143 0.030 5 0.410 0.226 0.049 10 0.651 0.401 0.096 50 0.995 0.923 0.395 We can observe that as K increases, the overall significance level αK increases. Note that if there are k groups in ANOVA, then the number of pairs of means to compare is K = ( k 2 ) = k! 2!(k − 2)! = k(k − 1)/2. For example, for 4 groups, we need 6 tests, and for 10 groups, we need 45 tests. Corrections To ensure that the overall test level is at the required significance level, the significance level of individual tests is usually adjusted. There are many different methods of “correcting” the significance level. The most common ones are Bonferroni correction, Tukey’s HSD (Honest Significance Difference) test, Scheffe’s test. We will discuss only the first two here. These methods are available in most statistical packages. 11.3 Bonferroni correction The significance level for each individual test is set as α∗ = α K . If each individual test is performed at level α∗, then Pr (Type I error for all K tests) ≤ K∑ i=1 Pr (Type I error for one test) = K∑ i=1 α∗ = K∑ i=1 α K = α. 67 So, in general the overall significance level is smaller than α. Example: If α = 0.05, and there are K = 10 tests, then α∗ = 0.05 10 = 0.005 and 1− (1− α∗)K = 1− 0.99510 = 0.0489 < 0.05. Example: If α = 0.01, and there are K = 20 tests, then α∗ = 0.01 20 = 0.0005 and 1− (1− α∗)K = 1− 0.999520 =0.00995 < 0.01. Example (Fisher’s LSD with Bonferroni correction): For the Energy Consumption example, we compared all the regions in pairs. The test statistics for the independent sample t-tests were calculated: T12 = −1.095; T13 = 1.643; T23 = 2.739 We performed 3 pair-wise comparisons. Applying Bonferroni correction, we will have to multiply the significance levels by 3 to adjust the overall significance level. For group 1 and 2, we found that the difference between group 1 and 2 is not significant even at 10%. Using Bonferroni correction, we have that the difference between group 1 and 2 is not significant even at 30%. For group 1 and 3, we found that the difference between group 1 and 3 is not significant even at 10%. Using Bonferroni correction, we have that the difference between group 1 and 3 is not significant even at 30%. For group 2 and 3, we found that the difference between group 2 and 3 is significant at 5%, but not at 2%. Using Bonferroni correction, we have that the difference between group 2 and 3 is significant at the 15% level, but not significant at the 6% level. There may be weak evidence that there is a difference between group 2 and 3, however we cannot identify this using tables since we would need the critical value at the 10 2×3 = 1.6667% level which is not in our tables. (We divided 10 by 2 to account for the two-sided test and by 3 for Bonferroni correction.) Using R, we can find the critical value t9(0.016667) = 2.5096. Since T23 = 2.739 > 2.5096, the difference between groups 2 and 3 (Midwest and South) is significant at the 10% level, but not at 6%. There is no evidence of differences between mean energy comsumption of Northeast and Midwest, or Northeast and South. There is weak evidence of difference between mean energy comsumption of Midwest and South. 68 11.4 Tukey’s HSD Tukey’s Honestly Significant Difference (HSD) is the default method used in pair-wise comparisons in R. This method deals with the problem of inflating the overall significance level effectively within the design of the method. R outputs the p-values for all pair-wise comparisons which are already adjusted. Note: In this module, you don’t need to know the theory behind Tukey’s HSD. However, you need to know what Tukey’s HSD is used for, how to perform it using R, and interpret the results. Example: We perform Tukey’s HSD for the Energy Consumption example, using R. (See Tutorial Sheet 8 how to perform this test in R.) From R we get: For group 1 and 2, the p-value is 0.5402. For group 1 and 3, the p-value is 0.2779. For group 2 and 3, the p-value is 0.0543. We can see that the difference between group 1 and 2 is not significant even at 54%, the difference between group 1 and 3 is not significant even at 27%, and the difference between group 2 and 3 is significant at 10%, but not at 5%. There is no evidence of differences between mean energy comsumption of Northeast and Midwest, or Northeast and South. There is weak evidence of difference between mean energy comsumption of Midwest and South. Theory behind Tukey’s HSD (not required) If you are interested, this is how Tukey’s HSD works (the theory of this method not required in this module): Consider the ordered means for each group µˆ(1), . . . , µˆ(k) (µˆ(1) is the smallest mean and µˆ(k) is the largest mean). Define the test statistic as T(1),(k) = µˆ(k) − µˆ(1)√ MSE/2× √ 1/n(1) + 1/n(k) where n(i) is the number of observations in group corresponding to the ith smallest mean µˆ(i). This statistic is then compared to tables of qk,N−k – studentised range distribution with k and N − k degrees of freedom. This distribution is the distribution of difference between the maximum and minimum values of a set of i.i.d. normal random variables normilised by an estimate of the standard deviation. If the statistic T(1),(k) is not significant, then there is no point in comparing the other pairs. If it is significant, then we proceed to compare µˆ(1) and µˆ(k−1) by calculating T1,k−1 and com- paring it with qk−1,N−k. VIDEO 69 12 Simple Linear Regression 12.1 Simple Linear Regression Model and Estimating Parameters Simple linear regression studies the relationship between a quantitative variable Y and a single quantitative variable x where Y is a dependent or response variable and x is an independent or explanatory variable. Example: Data below show eight people’s cholesterol readings one and two days after a non-fatal heart attack. Patient number (i) 1 2 3 4 5 6 7 8 Day 1 reading (xi) 270 236 210 142 280 272 160 220 Day 2 reading (yi) 218 234 214 116 200 276 146 182 First, we draw a scatterplot in order to assess whether there is a relationship between the two variables. If the points lie reasonably close to a straight line, we can fit linear regression model. We know that a straight line relationship is y = β0 + β1x, where β1 is the slope and β0 the intercept. But clearly the points aren’t exactly on a straight line, so one needs to allow for random variation. Since Y is continuous, the simplest thing is to assume that for each value of x, the individual responses of Y are normally distributed with mean β0 + β1x. In other words, we assume that the means of the responses Y all lie on a line µY = E(Y ) = β0 + β1x. This line is called the population regression line. For each observation of an explanatory variable xi, we can write the mean of response Yi as E(Yi) = β0 + β1xi. Of course, not every observed value of the response will equal the mean E(Yi). There will be some variation, error (denoted by i). So, we can write Yi = β0 + β1xi + i which is the simple linear regression model. VIDEO 70 The simple linear regression model The Simple Linear Regression model is Yi = β0 + β1xi + i for i = 1, 2, . . . , n where 1, . . . , n are independent with i ∼ N (0, σ2). The variables i are called the random errors. β0, β1 are unknown parameters to be estimated – β0 is the intercept, – β1 is the slope. Y is a response variable (random), with observed values y1, . . . , yn. x1, . . . , xn are observed values of an explanatory variable (not random). Note that Yi ∼ N (β0 + β1xi, σ2) Once we have drawn a scatterplot, if we decide the points look scattered about a straight line, the next step is to find the best straight line. That is, we will need to estimate β0, β1 in the linear regression model. We do that by using the Least Squares estimation method (similarly as with the ANOVA model). Least Squares estimation of parameters β0, β1 We define the Sum of Squared Deviations as S = n∑ i=1 (yi − β0 − β1xi) 2 and our aim is to find the estimates βˆ0, βˆ1 to make S as small as possible. We differentiate S: dS dβ0 = −2 n∑ i=1 (yi − β0 − β1xi) dS dβ1 = −2 n∑ i=1 xi (yi − β0 − β1xi) Note that S is a quadratic function of β0, β1 with a unique minimum point and no local maximum point. To find the minimum point, we set derivatives to zero: n∑ i=1 (yi − β0 − β1xi) = 0 (1) n∑ i=1 xi (yi − β0 − β1xi) = 0 (2) and solve two linear simultaneous equations in the two unknowns β0, β1. 71 First, from equation (1), we get( n∑ i=1 yi ) − nβ̂0 − β̂1 ( n∑ i=1 xi ) = 0 ⇒ ny¯ − nβ̂0 − nx¯β̂1 = 0 ⇒y¯ − β̂0 − β̂1x¯ = 0 ⇒β̂0 = y¯ − β̂1x¯ where x¯ = 1 n ∑ xi and y¯ = 1 n ∑ yi. Next, from equation (2), we obtain( n∑ i=1 xiyi ) − β̂0 ( n∑ i=1 xi ) − β̂1 ( n∑ i=1 x2i ) = 0 and we substitute for β̂0:( n∑ i=1 xiyi ) − ( y¯ − β̂1x¯ )( n∑ i=1 xi ) − β̂1 ( n∑ i=1 x2i ) = 0. We collect terms in β̂1 on the right hand side:( n∑ i=1 xiyi ) − y¯ ( n∑ i=1 xi ) = β̂1 (( n∑ i=1 x2i ) − x¯ ( n∑ i=1 xi )) and multiply through by n: n ∑ xiyi − (∑ yi ) (∑ xi ) = β̂1 ( n ∑ x2i − (∑ xi )2) . We solve for β̂1 to get: β̂1 = n ∑ xiyi −∑xi∑ yi n ∑ x2i − ( ∑ xi) 2 . To sum up, the Least Squares parameter estimates for the Simple Linear Regression are β̂1 = n ∑ xiyi −∑xi∑ yi n ∑ x2i − ( ∑ xi) 2 , β̂0 = y¯ − β̂1x¯, where x¯ = 1 n ∑ xi and y¯ = 1 n ∑ yi. Note: The estimate of β0 is obtained simply by observing that the sample mean point (x¯, y¯) lies on the fitted line, that is, y¯ = β̂0 + β̂1x¯. Example: We now find the fitted line for data in our example about cholesterol readings after a heart attack. The cholesterol data gives summary statistics:∑ xi = 1790, ∑ yi = 1586, ∑ x2i = 419244, ∑ y2i = 332148, ∑ xiyi = 369968. 72 We have n = 8, so that x¯ = 223.75 and y¯ = 198.25. Substituting into the formula for the slope, we get β̂1 = 8× 369968− 1790× 1586 8× 419244− 17902 = 0.806155 and hence β̂0 = 198.25− 0.806155× 223.75 = 17.873. The fitted line is ŷ = 17.873 + 0.806x. That is, Predicted Day 2 cholesterol = 17.873 + 0.806×Day 1 cholesterol. Point prediction In the example about cholesterol readings, suppose we want to predict the Day 2 cholesterol for a patient with Day 1 reading x0 = 210. Day 2 cholesterol reading for this patient is a random variable, Y0. The mean is E [Y0] = β0 + β1x0, but we cannot use this as our prediction because it involves the unknowns β0, β1. So, to predict, we use Ŷ0 = β̂0 + β̂1x0. Recall that β̂0 = 17.873, β̂1 = 0.806, so for the patient with x0 = 210, we have ŷ0 = 17.873 + 0.806× 210 = 187.133. Ŷ0 is referred to as the fitted value at x0. Fitted values and residuals Notice that in our original cholesterol sample, patient 3 had day 1 cholesterol reading x3 = 210. So, for this patient, we would predict day 2 cholesterol ŷ3 = 187.133 (the fitted value). But in fact, the day 2 reading was y3 = 214. The difference is called the residual, e3 = y3 − ŷ3 = 214− 187.133 = 26.867. Similarly, we can compute residual ei = yi − ŷi for each patient i = 1, 2, . . . , 8. QUIZ 12.2 Distributions of β̂0, β̂1 Estimating the parameters β0, β1 is useful, but not enough. We may also want to find confidence intervals for β0, β1 and to test hypotheses such as H0 : β1 = 0. For this, we need to know the distributions of β̂0, β̂1. Recall that β̂0, β̂1 are random variables, and a different sample would give different values for them. 73 Recall that the estimate of the slope is β̂1 = n ∑ xiyi −∑xi∑ yi n ∑ x2i − ( ∑ xi) 2 and notice that the randomness is all in the yi values – anything involving xi can be treated as a constant. Define non-random constants aj = nxj −∑xi n ∑ x2i − ( ∑ xi) 2 for j = 1, 2, . . . , n and observe that now we can write β̂1 = n∑ j=1 ajYj. VIDEO Expectations of β̂0, β̂1 First, we look at expectation of β̂1: E [ β̂1 ] = E [∑ ajYj ] = ∑ ajE [Yj] = ∑ aj (β0 + β1xj) = β0 (∑ aj ) + β1 (∑ ajxj ) Now aj = nxj −∑xi n ∑ x2i − ( ∑ xi) 2 = xj − x¯∑ x2i − nx¯2 and so n∑ j=1 aj = n∑ j=1 ( xj − x¯∑ x2i − nx¯2 ) = (∑ j xj ) − nx¯∑ x2i − nx¯2 = 0. Also, n∑ j=1 ajxj = n∑ j=1 ( xj − x¯∑ x2i − nx¯2 ) xj = ∑ j x 2 j − x¯ (∑ j xj ) ∑ x2i − nx¯2 = ∑ j x 2 j − nx¯2∑ x2i − nx¯2 = 1 and hence E [ β̂1 ] = β0 × 0 + β1 × 1 = β1. The estimator β̂1 is unbiased for β1. Next, consider expectation of β̂0: E [ β̂0 ] = E [ Y¯ − β̂1x¯ ] = E [ Y¯ ] − x¯ E [ β̂1 ] = E [ Y¯ ] − x¯β1. Now E [ Y¯ ] = 1 n n∑ i=1 E [Yi] = 1 n ∑ i (β0 + β1xi) 74 = 1 n ( nβ0 + β1 ∑ xi ) = 1 n (nβ0 + nx¯β1) = β0 + x¯β1 and hence E [ β̂0 ] = β0 + x¯β1 − x¯β1 = β0. So, the estimator β̂0 is unbiased for β0. Expectation of point prediction Notice that E [ Ŷ0 ] = E [ β̂0 + β̂1x0 ] = E [ β̂0 ] + x0E [ β̂1 ] = β0 + β1x0 = E [Y0] . Our prediction Ŷ0 has the correct mean. (Similar to being unbiased, except that the Y0 that we are trying to predict is a random variable, not a parameter.) Distribution of β̂1 Recall that for appropriate constants aj we have β̂1 = n∑ j=1 ajYj and we have shown β̂1 is unbiased for β1. The variables Y1, . . . , Yn are independent and normally distributed. It follows that β̂1 is normally distributed, with mean β1. It remains to find Var [ β̂1 ] . Variance of β̂1: Var [ β̂1 ] = Var  n∑ j=1 ajYj  = n∑ j=1 a2j Var [Yj] = n∑ j=1 a2jσ 2 = σ2 n∑ j=1 a2j Recall aj = nxj −∑xi n ∑ x2i − ( ∑ xi) 2 = xj − x¯∑ x2i − nx¯2 . So n∑ j=1 a2j = n∑ j=1 (xj − x¯)2 ( ∑ x2i − nx¯2)2 = ∑ j (xj − x¯)2 ( ∑ x2i − nx¯2)2 . 75 Now notice that n∑ j=1 (xj − x¯)2 = ∑ j ( x2j − 2xjx¯+ x¯2 ) = (∑ x2j ) − 2x¯ (∑ xj ) + nx¯2 = (∑ x2j ) − 2nx¯2 + nx¯2 ⇒ n∑ j=1 (xj − x¯)2 = n∑ j=1 x2j − nx¯2 Hence ∑ a2j = ∑ j x 2 j − nx¯2 ( ∑ x2i − nx¯2)2 = 1∑ i x 2 i − nx¯2 . We have now shown that Var [ β̂1 ] = σ2∑ i x 2 i − nx¯2 . Putting together with other results, we have β̂1 ∼ N ( β1, σ2∑ i x 2 i − nx¯2 ) . If we knew the value of σ, we could use this to work out confidence intervals for the slope parameter β1, and test hypotheses such as H0 : β1 = 0, using normal tables. We don’t know σ, so we will need to estimate it (discussed later). We will then be able to work out confidence intervals and hypothesis tests using t-tables, instead of normal, to allow for the estimated standard deviation. Before this, we will work out the distribution for the intercept parameter β0. Distribution of β̂0 Recall that β̂0 = Y¯ − β̂1x¯ = 1 n n∑ i=1 Yi − x¯ n∑ i=1 aiYi = n∑ i=1 ( 1 n − aix¯ ) Yi. That is, β̂0 is a linear combination of independent normal variables Yi, and so β̂0 is itself normally distributed. We have shown that β̂0 is unbiased for β0, that is E [ β̂0 ] = β0. We still need to find Var [ β̂0 ] . 76 Variance of β̂0: We proceed similarly as for Var [ β̂1 ] : Var [ β̂0 ] = Var [ n∑ i=1 ( 1 n − aix¯ ) Yi ] = n∑ i=1 ( 1 n − aix¯ )2 Var [Yi] = σ2 n∑ i=1 ( 1 n − aix¯ )2 = σ2 n∑ i=1 (( 1 n )2 − 2aix¯ n + a2i x¯ 2 ) = σ2 ( n× ( 1 n )2 − 2x¯ n ∑ i ai + x¯ 2 ∑ i a2i ) = σ2 ( 1 n − 2x¯ n × 0 + x¯2 × 1∑ i x 2 i − nx¯2 ) = σ2 ( 1 n + x¯2∑ i x 2 i − nx¯2 ) We have now shown β̂0 ∼ N ( β0, σ 2 ( 1 n + x¯2∑ i x 2 i − nx¯2 )) . 12.3 Confidence Intervals for β0, β1 We have shown that the distributions of the estimators β̂0, β̂1 each have variance proportional to the unknown parameter σ2. To compute confidence intervals or carry out hypothesis tests for β0, β1, we need to estimate σ 2. Estimating the error variance σ2 In a similar way to ANOVA, we define the Error Sum of Squares SSE = n∑ i=1 (yi − ŷi)2 = n∑ i=1 ( yi − β̂0 − β̂1xi )2 . It can be shown that SSE σ2 ∼ χ2n−2. Note that to show this is quite challenging and requires material which is outside syllabus of this module. Recall the expectation of a chi-squared random variable is equal to its degrees of freedom, so E [ SSE σ2 ] = n− 2. 77 Hence we define the Mean Square Error as MSE = SSE n− 2 . Then we have E [MSE] = E [ SSE n− 2 ] = E [ σ2 σ2 · SSE n− 2 ] = σ2 n− 2E [ SSE σ2 ] = σ2 n− 2 · (n− 2) = σ 2. So, we have that σ̂2 = MSE is an unbiased estimator of σ2. The standard deviation estimate is σ̂ = √ MSE. Inference in linear regression We have already shown β̂1 ∼ N ( β1, σ2∑ x2i−nx¯2 ) , so that β̂1 − β1 σ /√∑ x2i − nx¯2 ∼ N(0, 1) and SSE σ2 ∼ χ2n−2. It can be shown that SSE is independent of β̂0, β̂1 (outside syllabus of this module). Using the definition of the t-distribution, we get β̂1−β1 σ /√∑ x2i−nx¯2√ SSE σ2 / (n− 2) ∼ tn−2. Simplifying this expression, we get β̂1 − β1√ MSE /( ∑ x2i − nx¯2) ∼ tn−2. Recalling that σ̂ = √ MSE, then the estimated standard error of β̂1 is e.s.e. ( β̂1 ) = σ̂√∑ x2i − nx¯2 . Going through the same calculations for β̂0, we get that β̂0 − β0 e.s.e. ( β̂0 ) ∼ tn−2 with e.s.e. ( β̂0 ) = σ̂ √ 1 n + x¯2∑ x2i − nx¯2 . The above results will be used later for hypothesis testing in linear regression and estimated standard errors of β̂1 and β̂0 are used in confidence intervals. 78 Confidence intervals for β̂0, β̂1 Following the concept for deriving confidence intervals discussed in earlier chapters and using the above results, we deduce the formula for confidence intervals for β̂0, β̂1. The 100(1− α)% confidence interval for β1 is β̂1 ± tn−2(α/2)× e.s.e. ( β̂1 ) . The 100(1− α)% confidence interval for β0 is β̂0 ± tn−2(α/2)× e.s.e. ( β̂0 ) . Notes: n − 2 degrees of freedom because we have n observations but we had to use 2 estimated parameters β0, β1, which reduces degrees of freedom by 2. For calculation, a useful alternative formula for SSE is SSE = n∑ i=1 y2i − ( βˆ0 n∑ i=1 yi + βˆ1 n∑ i=1 xiyi ) . Example: The data below give the sale price y and floor area x for each of 10 houses sold during 1988 in Oxford, Ohio. Sale price y Floor area x ($1000) (square feet) 53.5 1008 49.0 1290 50.5 860 49.9 912 52.0 1204 55.0 1204 80.5 1764 86.0 1600 69.0 1255 46.0 864 Find the fitted line and prediction of sale price for a house of floor area x = 1000. Find also the 95% confidence interval for β1. 79 Summary statistics are as follows:∑ xi = 11961, ∑ yi = 591.4, ∑ x2i = 15143957, ∑ y2i = 36786, ∑ xiyi = 740846. Least Squares estimates: β̂1 = n ∑ xiyi − (∑xi) (∑ yi) n ∑ x2i − ( ∑ xi) 2 = 10× 740846− 11961× 591.4 10× 15143957− 119612 = 0.03997 β̂0 = y¯ − β̂1x¯ = 59.14− 0.03997× 1196.1 = 11.332 The line of best fit is ŷ = 11.332 + 0.03997x. For a house of floor area x = 1000 square feet, the predicted sale price is ŷ = 11.332 + 0.03997× 1000 = 51.302. That is, the predicted sale price is $51302. Interpretation of β̂1: Each 1 sq ft increase in floor area correspond to an increase in expected sale price of 0.03997× $1000 = $39.97. 80 To estimate error variance, we calculate SSE = n∑ i=1 y2i − ( βˆ0 n∑ i=1 yi + βˆ1 n∑ i=1 xiyi ) = 36786− (11.332× 591.4 + 0.03997× 740846) = 472.641, MSE = SSE n− 2 = 472.641 8 = 59.080, σ̂ = √ MSE = √ 59.08 = 7.686. For the slope parameter, the estimated standard error is e.s.e. ( β̂1 ) = σ̂√∑ x2i − nx¯2 = 7.686√ 15143957− 10× 1196.12 = 0.008399. The 95% confidence interval for β1 is β̂1 ± tn−2(0.025) e.s.e. ( β̂1 ) = 0.03997± t8(0.025)× 0.008399 = 0.03997± 2.306× 0.008399 = [0.0206, 0.0593]. Note: This confidence interval for β1 does not include zero, so if we were to test H0 : β1 = 0 against H1 : β1 6= 0, at the 5% significance level we would reject H0 : β1 = 0 in favour of H1 : β1 6= 0. There is evidence that floor area does have some effect upon sale price. 81 Since β̂1 > 0, it appears that greater floor area is associated with higher sale price. Exercise: Compute the 95% confidence interval for the intercept β0. Example – further discussion: What is the predicted sale price for a house of floor area x = 5000 square feet? Using the fitted model, we get ŷ = 11.332 + 0.03997× 5000 = 211.182. So, the predicted sale price is $211182. However, we should be very wary of using this figure. Note that the observed data range from x = 860 sq ft up to x = 1764 sq ft. The value of x = 5000 is well outside the observed range. Even if the line ŷ = 11.332 + 0.03997x fits the observed data very well, one should not extrapolate a long way outside the observed range of x values (since the trend in data may change for values outside our range). QUIZ 12.4 Confidence Intervals for a Mean Response and Prediction Intervals Suppose we use observed data (xi, yi) for i = 1, 2, . . . , n to fit a regression line, and then we want to predict the response for a new value x0 of the explanatory variable. Recall that we have Y0 = β0 + β1x0 + 0 with E [0] = 0, the mean response at x0 is E [Y0] = β0 + β1x0, the predicted response is Ŷ0 = β̂0 + β̂1x0. The mean response β0 + β1x0 is not a random variable, it is an unknown constant, so we can work out a confidence interval for it. The actual response Y0 is a random variable. Confidence interval for a mean response We showed before that Ŷ0 = β̂0 + β̂1x0 is an unbiased estimator of the mean response β0 + β1x0. In the same way as for β̂0 and β̂1, we can show that Ŷ0 = β̂0 + β̂1x0 is normally distributed, and we can work through some algebra to find a formula for its variance. It turns out that Var [ Ŷ0 ] = σ2 h00 where h00 = 1 n + (x0 − x¯)2∑ x2i − nx¯2 . (To show this is quite complicated and outside syllabus of this module.) We need to estimate σ, using σ̂ = √ MSE as before. 82 Then a 100(1− α)% CI for the mean response at x0 is[ Ŷ0 − tn−2 ( α 2 )√ MSE × h00, Ŷ0 + tn−2 ( α 2 )√ MSE × h00 ] . Example: We use the data on sale price y versus floor area x, to find the 95% confidence interval for the mean response at x0 = 1000. For a house with floor area x0 = 1000 sq ft, we found predicted sale price ŷ0 = 51.302. Our estimate of the mean sale price of all houses with floor area x0 = 1000 is also 51.302 thousand dollars. For these data, we also found MSE = 59.080. Now we calculate h00 = 1 n + (x0 − x¯)2∑ x2i − nx¯2 = 1 10 + (1000− 1196.1)2 15143957− 10× 1196.12 = 0.1459. The 95% confidence interval for mean response at x0 = 1000 is 51.302± t8 (0.025) √ 59.080× 0.1459 = 51.302± 2.306×√59.080× 0.1459 = [44.532, 58.072]. We are 95% confident that the mean sale price of houses of floor area 1000 square feet is between $44532 and $58072. But what if we want to predict the sale price of one particular house of floor area x0 = 1000? A useful prediction should include a margin of error to indicate its accuracy. The interval used to predict an individual observation is called a prediction interval. Prediction intervals Suppose we want to predict the response Y0 corresponding to x0. We have Y0 = β0 + β1x0 + 0. So far, we have worked out a confidence interval for the mean response β0 + β1x0. If we want to work out an interval for the particular response Y0, we need to add in the extra uncertainty from 0. That is, the mean is unchanged (since E[0] = 0), but we add an extra σ 2 to the variance term (since Var[0] = σ 2). The relevant variance becomes σ2 (1 + h00). Estimating σ by √ MSE, then a 100(1− α)% prediction interval (PI) for a response at x0 is[ Ŷ0 − tn−2 ( α 2 )√ MSE × (1 + h00), Ŷ0 + tn−2 ( α 2 )√ MSE × (1 + h00) ] . 83 Example: We want to find the 95% prediction interval for the sale price of an individual house with floor area x0 = 1000. We found before the predicted sale price ŷ0 = 51.302, MSE = 59.080 and h00 = 0.1459. So, the 95% prediction interval is 51.302± t8 (0.025) √ 59.080× (1 + 0.1459) = 51.302± 2.306×√59.080× 1.1459 = [32.328, 70.276]. That is, for an individual house of floor area 1000 sq ft, we are 95% confident that the sale price will be between $32328 and $70276. Notes about confidence and prediction intervals The PI for an individual observation is wider than the CI for a mean response. Averaging reduces uncertainty / variability. The width of the CI and PI depends on x0 only through the value of h00. Recall h00 = 1 n + (x0 − x¯)2∑ x2i − nx¯2 and notice that the value x0 only appears in the term (x0 − x¯)2. Hence the further away from x¯ we go, the bigger h00 gets, and the wider the interval (CI or PI) is. Our prediction becomes more uncertain as we move away from the mean x¯ of our original observed data. Example: Using data in the previous example, compute a 99% PI for the sale price of (a) a house of floor area 1200 sq ft, (b) a house of floor area 1800 sq ft. (a) With x0 = 1200, then ŷ0 = 11.332 + 0.03997× 1200 = 59.296. Now we compute h00 = 1 10 + (1200− 1196.1)2 15143957− 10× 1196.12 = 1 10 + 15.21 837404.9 = 0.1 + 0.0000182 = 0.1000182. And now the 99% PI is 59.296± t8 (0.005) √ 59.080× (1 + 0.1000182) = 59.296± 3.355×√59.080× 1.1000182 = [32.249, 86.343]. (b) With x0 = 1800, then ŷ0 = 11.332 + 0.03997× 1800 = 83.278. Now we compute h00 = 1 10 + (1800− 1196.1)2 15143957− 10× 1196.12 = 1 10 + 364695.21 837404.9 = 0.1 + 0.435506 = 0.535506. 84 The 99% PI is 83.278± t8 (0.005) √ 59.080× (1 + 0.535506) = 83.278± 3.355×√59.080× 1.535506 = [51.323, 115.233]. For a house of floor area 1200 sq ft, close to the middle of the observed data (x¯ = 1196.1), we are 99% confident the sale price will be between $32249 and $86343, an interval of width $54094. For a house of floor area 1800 sq ft, far from the middle of the observed data, we are 99% confident the sale price will be between $51323 and $115233, an interval of width $63910. The second interval is wider, indicating more uncertainty. Further from the bulk of the data, we cannot be so confident about our prediction. Note also: The 99% PI will be wider than a 95% PI, because we want to be more certain of including the true answer. Suppose we want to predict the sale price of a house of floor area 5000 sq ft. – We could work out the PI; it will be wider than the PI at x = 1800, because we are even further from x¯. – But notice that the largest x-value in the observed data set was x = 1764. – As discussed before, it is not sensible to extrapolate well beyond the range of the observed data, so we shouldn’t really trust any computed estimate or PI at x = 5000 anyway. VIDEO QUIZ 12.5 Hypothesis Testing in Linear Regression In this section, we discuss how to perform tests to formally check that the slope is non-zero, positive, or negative and so whether there is really a linear relationship between x and y. We also discuss a test for the intercept. We have already seen that β̂1 − β1 e.s.e. ( β̂1 ) ∼ tn−2 and β̂0 − β0 e.s.e. ( β̂0 ) ∼ tn−2 where e.s.e. ( β̂1 ) = σ̂√∑ x2i − nx¯2 , e.s.e. ( β̂0 ) = σ̂ √ 1 n + x¯2∑ x2i − nx¯2 . So far, we used these results to work out confidence intervals; but we can also do hypothesis testing. 85 Test for the slope To test H0 : β1 = c versus an alternative H1 : β1 < c, H1 : β1 > c, or H1 : β1 6= c, for any given constant c, we compute the test statistic T = β̂1 − c e.s.e. ( β̂1 ) and compare it with tn−2 tables. The rejection rules are the same as for a one-sample t-test, only with the test statistic stated above. For the alternative H1 : β1 < c, we reject H0 at α if T < −tn−2(α). For the alternative H1 : β1 > c, we reject H0 at α if T > tn−2(α). For the alternative H1 : β1 6= c, we reject H0 at α if T < −tn−2(α2 ) or T > tn−2(α2 ), in other words, if |T | > tn−2(α2 ). Remember to interpret your conclusion practically in the context of the question. Test for the intercept To test H0 : β0 = c versus an alternative H1 : β0 < c, H1 : β0 > c, or H1 : β0 6= c for any given constant c, we compute the test statistic T = β̂0 − c e.s.e. ( β̂0 ) and compare with tn−2 tables. For the alternative H1 : β0 < c, we reject H0 at α if T < −tn−2(α). For the alternative H1 : β0 > c, we reject H0 at α if T > tn−2(α). For the alternative H1 : β0 6= c, we reject H0 at α if T < −tn−2(α2 ) or T > tn−2(α2 ), in other words, if |T | > tn−2(α2 ). Remember to interpret your conclusion practically in the context of the question. Example: For the house price data, we want to formally test whether sale price increases as floor area increases. That is, we test H0 : β1 = 0 versus H1 : β1 > 0. The test statistic is T = β̂1 − 0 e.s.e. ( β̂1 ) = 0.03997 0.008399 = 4.759. Now we compare this with tn−2 tables. That is, relevant critical values are t8(0.05) = 1.860, t8(0.01) = 2.896, t8(0.001) = 4.501, t8(0.0005) = 5.041. We have t8(0.001) = 4.501 < 4.759 < t8(0.0005) = 5.041. So, we can reject H0 at the 0.1% level (but not at the 0.05% level). That is, we have 0.0005 < p < 0.001. There is very strong evidence that an increase in floor area is associated with an increase in sale price. QUIZ 86 Analysis of variance for regression Another way to formally test whether the slope is non-zero, that is testing H0 : β1 = 0 versus H1 : β1 6= 0 is to use ANOVA test for regression. ANOVA also summarises information about the sources of variation in the data which is useful for model assessment (discussed later). First we define Sums of Squares, similarly to ANOVA. We can define the Total Sum of Squares SST = n∑ i=1 (yi − y¯)2 with associated degrees of freedom n− 1. SST does not depend on xis and is a measure of variation of the variable Y around its mean. Notice that SST/(n− 1) is a sample variance of y1, y2, . . . , yn (if the mean of yi is independent of xi, that is if β1 = 0). This means that under H0 : β1 = 0, SST σ2 ∼ χ2n−1. We also define the Error Sum of Squares and the Mean Squared Error for linear regression as SSE = n∑ i=1 (yi − ŷi)2 MSE = SSE n− 2 with associated degrees of freedom n− 2. SSE indicates how much of the variation in the observed yi values is left over after the regression model ŷi has been taken into account. We know that SSE σ2 ∼ χ2n−2. We showed before that MSE is an unbiased estimator for the variance σ2. Finally, we define the Regression Sum of Squares and the Regression Mean Squares SSR = n∑ i=1 (ŷi − y¯)2 MSR = SSR 1 . SSR has df = 1 and SSR σ2 ∼ χ21. The fitted values ŷi are predictions from the linear regression model. SSR indicates how much of the variation in the data can be explained by the regression rela- tionship. There is a relationship for the Sums of Squares: SST = n∑ i=1 (yi − ŷi + ŷi − y¯)2 = n∑ i=1 (yi − ŷi)2 + n∑ i=1 (ŷi − y¯)2 + 2 n∑ i=1 (yi − ŷi) (ŷi − y¯) 87 Using some algebraic simplifications, we can show ∑n i=1 (yi − ŷi) (ŷi − y¯) = 0. VIDEO We have SST = SSR + SSE Total variation = Variation explained by Regression relationship + Unexplained residual (error) variation We can draw up an ANOVA table and compute an F -statistic (similar to the standard ANOVA). This is used to test H0 : β1 = 0 versus H1 : β1 6= 0. Source DF SS MS F Regression 1 SSR SSR 1 MSR MSE Error n− 2 SSE SSE n− 2 Total n− 1 SST with SST = SSE + SSR. We compare F = MSR MSE with F1, n−2 tables since MSR MSE ∼ F1, n−2. We reject H0 : β1 = 0 if F > F1, n−2(α). That is, we reject H0 if the variation explained by the regression relationship is large, compared to the variation explained by pure randomness. In other words, we reject H0 if the regression relationship does usefully explain some of the variation in the data. So, the ANOVA F -test for regression tests whether there really is a linear relationship between x and y, which is equivalent to testing H0 : β1 = 0 against H1 : β1 6= 0. Notes about the ANOVA F -test for regression: The ANOVA F -test for regression is a one-sided test for the same reason as the standard ANOVA test. We actually compare two estimators of variance – MSR and MSE and we test whether MSR is significantly bigger than MSE. The ANOVA F -test for regression can be only used to test the alternative H1 : β1 6= 0 and it is equivalent to the two-sided t-test for the slope. These two tests are equivalent because F1,n−2 ∼ (tn−2)2. If you want to test the alternative H1 : β1 > 0 or H1 : β1 < 0, then you need to use a one-sided t-test for the slope which was discussed earlier. Example: For the house price data, we want to formally test whether there is a linear relationship between the floor area of houses and the sale price. We have already computed SSE = 472.641. Now SST = ∑ (yi − y¯)2 = ∑ y2i − ny¯2 = 36786− 10× 59.142 = 1810.604. ANOVA table: 88 Source DF SS MS F Regression 1 1337.963 1337.963 22.647 Error 8 472.641 59.080 Total 9 1810.604 Relevant critical values: F1,8(0.10) = 3.458, F1,8(0.05) = 5.318, F1,8(0.01) = 11.26. Since 22.647 > F1,8(0.01) = 11.26, we can reject H0 at 1% level. There is strong evidence that there is a linear relationship between the floor area x and the sale price y (the linear regression explains a significant fraction of the variation in the observed y values). The same conclusion as we reach when testing H0 : β1 = 0 against H1 : β1 6= 0 using a t-test. 12.6 Model assessment We might want to assess how useful the model is in explaining the data. This can be assessed (informally) by R2 and formally tested using a t-test for the slope or ANOVA F -test for regression. R2 Recall: SSR measures how much of the variation in the y values is explained by the regression, SST measures the total amount of variation in the y values. To measure the proportion of total variation explained by the regression, we can use the ratio R2 = SSR SST . Note that 0 ≤ R2 ≤ 1 since 0 ≤ SSR ≤ SST ⇒ 0 ≤ SSR SST ≤ SST SST ⇒ 0 ≤ R2 ≤ 1. The R2 value gives a useful quick assessment of how well linear model fits, but it is not enough on its own to decide whether we have an adequate model for the data. If R2 is close to 0, discard the linear regression and consider other models. If R2 is close to 1, linear regression may be a good model (although no guarantee of this). Example: For the house price data, we have SSR = 1337.963 and SST = 1810.604, so that R2 = 1337.963 1810.604 = 0.739. We say that the regression explains 73.9% of the variation. This is quite a high percentage, suggesting the linear regression model may be adequate. However, we now need to do further assessments; next we examine residuals. 89 Residuals Recall the linear regression model Yi = β0 + β1xi + i. Once the model has been fitted, the residuals are defined by ei = yi − ŷi = yi − β̂0 − β̂1xi and the Error Sum of Squares is SSE = n∑ i=1 e2i . The Least Squares estimation method chooses β̂0, β̂1 to make SSE as small as possible. The ei values are estimates of the i. The standard assumption about the errors is that 1, . . . , n are independent with i ∼ N (0, σ2). To check this, we examine the residuals ei: Plot a histogram of the residuals to see if it looks like a normal distribution. More formally, we do a normal probability plot and find the p-value for the normality test. To see if the model is appropriate, plot the residuals against the fitted values (or plot the residuals against the explanatory variable). If the linear regression model is appropriate, there should be no pattern in the plots. Example: House Sale Prices vs Floor Area The fitted line was y = 11.332 + 0.03997x. We plot the residuals against the fitted values or against floor area. 90 The pattern in both plots is the same because of the linear relationship between the fitted values and floor area. It seems the residuals for small xi (or yˆi) are mostly positive whereas for mid range of values they are mostly negative. This suggests that there might be some pattern in the data not captured by linear regression. However, there are only 10 observations here so it is difficult to know if there really is any pattern. VIDEO Key steps in regression 1. Draw a scatterplot of the data, see if it looks like a straight line that could usefully describe the relationship. 2. If so, compute the Least Squares parameter estimates β̂0, β̂1. 3. Assess whether the fitted model is adequate by computing R2 value, formally testing H0 : β1 = 0 against H1 : β1 6= 0, examining residuals to see if normally distributed, plot residuals against fitted values or explanatory variables to see if there are any patterns. 4. We may then go on to work out confidence intervals for β0, β1, test other hypotheses, e.g. H0 : β0 = 0 against H1 : β0 6= 0, compute prediction intervals for future response values Y0. VIDEO 91 12.7 Multiple Linear Regression and Model Transformations Multiple linear regression Usually, a response Y depends on many things, not just on a single explanatory variable x. Example: Amount of fuel (tons of coal) required to heat a particular building for a week may depend upon average temperature during the week, average wind velocity, and cloud cover. Yi = Coal required (tons) during week i x1i = Average temperature ( oC) x2i = Average wind velocity (mph) x3i = Cloud cover (percent) A linear model here would be Yi = β0 + β1x1i + β2x2i + β3x3i + i and one may test whether this provides a good fit to the data. In this context, the linear regression ANOVA table allows us to compute an F -value and test H0 : β1 = β2 = β3 = 0 against H1 : β1, β2, β3 not all zero. Alternatively, we can carry out t-tests for each of the parameters separately. For instance, to test whether cloud cover actually has any effect, we would test H0 : β3 = 0 against H1 : β3 6= 0. We can get a rough idea of how well the model fits by computing R2 = SSR SST . R2 gives a general idea of how well the model fits the observed data. Note that our testing methods won’t reveal if there is some strong non-linear relationship present. We always need to look at scatterplots, of (yi against x1i), and (yi against x2i), and so on. Model transformations Suppose a linear model is not adequate. For example, in physics/chemistry/biology, there are situations where the scientific theory suggests a relationship of the form y = axb for unknown parameters a, b, with observed data (xi, yi) for i = 1, . . . , n. Taking logs, the relationship becomes log(y) = log(a) + b log(x). So, we define ui = log (xi), vi = log (yi), β0 = log(a) and β1 = b, and then v = β0 + β1u. We can apply standard linear regression theory to the transformed data (ui, vi). 92 Polynomial regression Suppose that looking at a scatterplot suggests a non-linear relationship, e.g. quadratic. We want to fit the model yi = β0 + β1xi + β2x 2 i + i. We could define new explanatory variables x1i = xi x2i = x 2 i and we now have the Multiple Linear Regression model yi = β0 + β1x1i + β2x2i + i. Similarly for fitting cubic, quartic, or higher order polynomial. 93 13 Non-parametric Tests All tests we discussed so far assume that data are normally distributed. However, not all data are normally distributed. There are a few things to consider when data are not normally distributed: 1. Most tests we have considered are robust to non-normality, so it is OK to use them provided: (a) data not too drastically far from normal distribution, (b) the sample size is fairly large (usually n ≥ 30). 2. If the data values xi are not normally distributed, then it may be that log (xi) values are normally distributed, or √ xi, or some other transformation of xi may be normally distributed. 3. If data are not normally distributed, we can use distribution-free (non-parametric) tests. Transformations – Example The following data show the wages (dollars per hour) of 50 workers in the USA in 1985. 7.50, 4.25, 24.98, 8.50, 7.00, 5.50, 6.28, 7.61, 4.35, 10.00, 5.65, 20.00, 13.26, 8.85, 15.00, 4.50, 6.10, 3.60, 7.70, 10.00, 18.16, 7.00, 4.35, 10.00, 5.62, 10.00, 7.88, 7.65, 13.98, 12.50, 4.85, 4.25, 5.80, 6.00, 3.35, 3.40, 7.50, 6.50, 7.00, 3.35, 6.50, 6.00, 5.50, 6.75, 6.25, 3.65, 44.50, 3.75, 7.53, 3.50 We plot a histogram: We can see that data are clearly very right-skewed. We try to transform data by taking natural logs. Histogram of ln (Wage): 94 The distribution of data looks a lot closer to normal (although still somewhat right-skewed). To work out the 95% confidence interval for the mean wage, we use ln (Wage) data. The sample mean is 1.96 and sample standard deviation is 0.5435, giving the 95% CI for the mean of ln (Wage) as 1.96± t49(0.025)× 0.5435√ 50 = [1.8056, 2.1145]. (Note that t49(0.025) was found using R.) The 95% CI for the mean wage is [exp(1.8056), exp(2.1145)] = [6.0836, 8.2854]. We are 95% confident that the mean wage for the population is between $6.08 per hour and $8.29 per hour. If we worked with the raw wage data directly (the sample mean $8.474, the sample standard deviation $6.822), we would get a different (and wrong) answer. The raw data are not normally distributed, so we cannot use the normal-based formula for a confi- dence interval. 13.1 The Sign test If the data are not normally distributed, we can use distribution-free tests, also known as non- parametric tests. We now discuss a simple non-parametric test, the Sign test. The Sign test is used to test whether the population median is equal to some pre-specified value if data are not normally distributed. Assumptions for the Sign test: Data are independent of each other. Data come from some continuous distribution. 95 The Sign test – two-sided alternative We consider a single sample x1, . . . , xn and denote by θ the population median. We want to test whether θ is equal to some pre-specified value θ0. That is, we test H0 : θ = θ0 against H1 : θ 6= θ0. Test procedure: Define S+ = number of xi values above θ0 S− = number of xi values below θ0 and then set S = min {S+, S−} which is the test statistic. Values with xi = θ0 may be ignored. Denote by N the number of observations left after we ignore any with xi = θ0. Look up critical values in the table, and we reject H0 at α if S ≤ critical value at α/2. Reasoning: If the true median is θ0, then S+ and S− should be roughly the same, each being approximately N 2 . If either S+ or S− is too small, this is evidence against H0. Interpret your conclusion practically in the context of the question. VIDEO Example: It is claimed that the median waiting time at a doctor’s surgery is 10 minutes. A random sample of waiting times of 12 patients was recorded. Waiting times (minutes): 12, 1, 18, 2, 18, 40, 23, 11, 12, 18, 25, 20 We test H0 : θ = 10 versus H1 : θ 6= 10. (That is, θ0 = 10 here.) For these data, we count S− = 2, S+ = 10. Hence S = min{10, 2} = 2. There are no observed times equal to 10 minutes, so N = 12 here. From the table of critical values, for a two-sided test: At 10% and 5% levels, critical value is 2, so we reject H0 at the 5% level since S = 2 ≤ 2. At 2% level, critical value is 1, so we cannot reject H0 at the 2% level since S = 2 > 1. Hence the p-value satisfies 0.02 < p ≤ 0.05. There is evidence to reject H0, although not strong evidence. There is evidence that the median waiting time differs from 10 minutes. VIDEO 96 The Sign test – calculating the exact p-value It is better if we can work out the p-value exactly, and for the Sign Test this is quite straightforward. For any continuous distribution, each observation xi is above the median with probability 0.5, below the median with probability 0.5 (that is what ‘median’ means). Observations xi are independent of each other. So if H0 is true, then S+ ∼ Binomial (N, 0.5) , S− ∼ Binomial (N, 0.5) . Recall the p-value is the probability of getting a value as extreme or more extreme than that observed, if the null hypothesis is true. So for the two-sided alternative, p = 2× Pr (Binomial (N, 0.5) ≤ S) . Note that it is a two-sided test, and the Binomial (N, 0.5) distribution is symmetrical. Example: We calculate the exact p-value in the example about the doctor’s waiting times. We found that N = 12 and S = 2. Then p = 2× Pr (Binomial (12, 0.5) ≤ 2) = 2× (( 12 0 ) 0.512 + ( 12 1 ) 0.512 + ( 12 2 ) 0.512 ) = 2× (1 + 12 + 66)× 0.512 = 0.0386. Using the tables of critical values, we found 0.02 < p ≤ 0.05; we can now be more precise, and say that p = 0.0386. There is evidence that median waiting time differs from 10 minutes. The Sign test – one-sided alternative For a one-sided alternative, we use either S+ or S− as appropriate, instead of S, as the test statistic. For H0 : θ = θ0 against H1 : θ < θ0, we reject H0 if S+ ≤ critical value. For H0 : θ = θ0 against H1 : θ > θ0, we reject H0 if S− ≤ critical value. A small value of S+ indicates not many waiting times above θ0, tending to support the hypothesis that the true median waiting time is less than θ0. On the other hand, a small value of S− indicates not many waiting times below θ0, tending to support the hypothesis that the true median waiting time is greater than θ0. VIDEO 97 Example: For doctor’s waiting time data, we now want to test whether the median waiting time is greater than 10 minutes. That is, we test H0 : θ = 10 versus H1 : θ > 10. The test statistic is S− = 2. For a one-sided test, with N = 12, the table gives 5% and 2.5% critical values as 2. Since S− = 2 ≤ 2, we can reject H0 at these levels. 1% critical value is 1. Since S− = 2 > 1, we cannot reject H0 at the 1% level. That is, 0.01 < p ≤ 0.025. There is evidence that the median waiting time is greater than 10 minutes. It is better to compute the exact p-value. For the one-sided test, p = Pr (Binomial (12, 0.5) ≤ 2) = 0.0193. There is evidence (fairly strong) that the median waiting time is greater than 10 minutes. The Sign test for paired data More often, we want to compare two samples and see if they differ. Provided the data are paired, we can use the Sign test to compare medians of paired data, if the paired differences are not normally distributed. Similarly as for the paired sample t-test, we compute the difference between the two paired obser- vations on each experimental subject, and then apply the one-sample Sign test to the sample of differences. Denoting by θd the median of the differences, we then test H0 : θd = 0 versus an alternative H1 : θd 6= 0, H1 : θd > 0, or H1 : θd < 0. Example: To test the fuel economy of Radial tyres (R) versus Belted tyres (B), twelve cars were each driven over the same test course, firstly with radial tyres, and then again with belted tyres fitted. The petrol consumption in miles per litre was recorded each time. The aim is to test whether the new Radial tyres are better than the more traditional Belted. The results were as follows. Car Fuel consumption Difference (i) Radial R Belted B R−B 1 4.2 4.1 0.1 2 4.7 4.9 −0.2 3 6.6 6.2 0.4 4 7.0 6.9 0.1 5 6.7 6.8 −0.1 6 4.5 4.4 0.1 7 5.7 5.7 0.0 8 6.0 5.8 0.2 9 7.4 6.9 0.5 10 4.9 4.9 0.0 11 6.1 6.0 0.1 12 5.2 4.9 0.3 98 We want to know if Radial are better, i.e. more miles per litre, so we test H0 : θd = 0 versus H1 : θd > 0 where d = R−B. For the Sign test statistic, we count how many of the observations go against H1. That is, how many cars have R−D < 0? For these data, we have S− = 2. If H1 is true, we would expect S− to be small; the question is whether it is small enough to be significant. Notice that the number of non-zero differences is N = 10. We compute the p-value: p = Pr (Bin(10, 0.5) ≤ 2) = ( 10 0 ) 0.510 + ( 10 1 ) 0.510 + ( 10 2 ) 0.510 = (1 + 10 + 45)× 0.510 = 0.0547 We have 0.05 < p < 0.10, so we can reject H0 at the 10% level, but not at the 5% level. There is only weak evidence that Radial tyres are better than Belted. QUIZ 13.2 Wilcoxon signed rank test The Sign test only assumes that data are independent observations from some continuous distribution. If we assume data are independent observations from some continuous symmetrical distribution, we can use a more powerful test – the Wilcoxon signed rank test. Like the Sign test, this test can be applied to a single sample, or to compare two samples of paired data. We discuss the Wilcoxon signed rank test for two samples of paired data. Denote by d1, . . . , dn the sample of differences, and by θd the population median difference. Assumptions: The sample of paired differences d1, . . . , dn come from some continuous symmetrical distribution and the differences are independent of each other. We test H0 : θd = 0 against H0 : θd 6= 0. (We will consider one-sided alternatives later.) Test procedure: Compute the sample differences d1, . . . , dn. Denote by N the number of observations left after we ignore any where the difference is zero. For the non-zero differences, ignoring whether positive or negative, rank their absolute values from smallest to largest. 99 Add up the ranks of all the positive differences, denoted W+. Add up the ranks of all the negative differences, denoted W−. For the two-sided test, set W = min {W+, W−} which is the test statistic. Look up critical value in tables, reject H0 at α if W ≤ critical value at α/2. Interpret your conclusion practically in the context of the question. Example: We use the data from the previous example about the fuel consumption for radial and belted tyres. We want to test whether there are differences in fuel consumption for these two types of tyres. We use the Wilcoxon signed rank test. We test H0 : θd = 0 against H0 : θd 6= 0 where θd is the population median difference between the fuel consumption for radial and belted tyres. We rank the differences. Car Fuel consumption Difference Rank (i) Radial R Belted B R−B 1 4.2 4.1 0.1 3 (1) 2 4.7 4.9 −0.2 6.5 (6) 3 6.6 6.2 0.4 9 (9) 4 7.0 6.9 0.1 3 (2) 5 6.7 6.8 −0.1 3 (3) 6 4.5 4.4 0.1 3 (4) 7 5.7 5.7 0.0 – 8 6.0 5.8 0.2 6.5 (7) 9 7.4 6.9 0.5 10 (10) 10 4.9 4.9 0.0 – 11 6.1 6.0 0.1 3 (5) 12 5.2 4.9 0.3 8 (8) The number of non-zero differences is N = 10. Sum of positive ranks: W+ = 3 + 9 + 3 + 3 + 6.5 + 10 + 3 + 8 = 45.5 Sum of negative ranks: W− = 6.5 + 3 = 9.5 Test statistic: W = min{45.5, 9.5} = 9.5 Critical values from tables, for the two-sided test: At the 10% level, the critical value is 10, so values of 10 or less are significant. Since we have W = 9.5 < 10, we reject H0 at the 10% level. At the 5% level, the critical value is 8, so values of 8 or less are significant. Since W = 9.5 > 8, we do not reject H0 at the 5% level. We have 0.05 < p ≤ 0.10. There is weak evidence that the fuel consumption for radial and belted tyres is different. VIDEO 100 Notes about the Wilcoxon signed rank test Ties: When ranking the (absolute) differences, if there are ties, we average the ranks (see tyres example above). Arithmetic check: We should always have W+ +W− = 1 + 2 + . . .+N = N(N + 1)/2. One-sided alternative: For a one-sided alternative, our test statistic is whichever of W+ or W− will be small if H1 is true. For H1 : θd < 0, we reject H0 at α if W+ ≤ critical value at α. For H1 : θd > 0, we reject H0 at α if W− ≤ critical value at α. A single sample: For a single sample of observations x1, . . . , xn, if we test H0 : θ = θ0 against an alternative, we find the differences xi − θ0, i = 1, . . . , n and then follow the test procedure. Example: In the tyres example, we originally wanted to see if radial tyres are better than belted. We now test this using the Wilcoxon signed rank test. We test H0 : θd = 0 against H1 : θd > 0. Test statistic: W− = 9.5. Recall N = 10, so critical values for the one-sided test are: At the 5% level, the critical value is 10, so values of 10 or less are significant. Since we have W− = 9.5 < 10, we reject H0 at the 5% level. At the 2.5% level, the critical value is 8, so values of 8 or less are significant. SinceW− = 9.5 > 8, we do not reject H0 at the 2.5% level. We have 0.025 < p ≤ 0.05. There is evidence that radial tyres are better than belted. VIDEO QUIZ Sign test versus Wilcoxon signed rank test Recall that for the tyres data, the Sign test gave p = 0.0547. For the same data, and the same hypotheses H0 : θd = 0 versus H1 : θd > 0, the Wilcoxon signed rank test gives 0.025 < p ≤ 0.05. (In fact, using R, we find p = 0.037.) The Wilcoxon signed rank test shows stronger evidence than the Sign test, that radial tyres are better. The difference occurs because the Sign test only assumes data come from some continuous distribu- tion; the Wilcoxon test further assumes that this distribution is symmetrical. The Sign test statistic is computed using only the signs of the differences; the Wilcoxon test also takes into account their magnitudes. Because it has stronger assumptions, and makes use of more information, the Wilcoxon test is more powerful. That is, if H1 is true, the Wilcoxon test has more chance than the Sign test of correctly detecting this, and rejecting H0 (for a given significance level). 101 13.3 Wilcoxon rank sum test (Mann-Whitney U-test) Suppose now we have two independent samples. Sample 1: x1, . . . , xn Sample 2: y1, . . . , ym Notice that the two samples may be of different sizes. Assumptions: both samples are from continuous distributions, the distributions of samples are similar in shape, the samples are independent. We denote by θX the median of population 1, and by θY the median of population 2. We want to test the null hypothesis H0 : θX = θY against an alternative hypothesis H1 : θX 6= θY , H1 : θX > θY , or H1 : θX < θY . Test procedure: Rank all n+m observed values from smallest to largest. Add up the ranks of the x values, denoted Rx. Add up the ranks of the y values, denoted Ry. Calculate Ux = Rx − n(n+ 1) 2 , Uy = Ry − m(m+ 1) 2 . For the two-sided test, set U = min {Ux, Uy} which is the test statistic. For a one-sided test, the test statistic is whichever of Ux and Uy will be small if H1 is true. Look up critical values in tables, and – for H1 : θX 6= θY , we reject H0 at α if U ≤ critical value at α/2, – for H1 : θX > θY , we reject H0 at α if Uy ≤ critical value at α, – for H1 : θX < θY , we reject H0 at α if Ux ≤ critical value at α. Example: To investigate whether a new drug will arrest leukaemia, nine mice were selected from a group that had all reached an advanced stage of the disease. Five mice received the drug; four did not. Survival times (in years) from the time treatment started were as follows. No treatment (x): 1.6, 0.5, 1.9, 0.8 Treatment (y): 2.1, 5.3, 1.4, 4.6, 0.9 Assume that data do no follow a normal distribution. We denote the median survival time of untreated mice and treated mice by θx and θy, respectively. Null and alternative hypotheses: H0 : the drug does not arrest leukaemia, or medians survival time are the same for both treated and untreated mice, that is H0 : θx = θy. 102 H1 : the drug does arrest leukaemia, or the median survival time for treated mice is greater than for untreated mice, that is H1 : θx < θy. This is a one-sided alternative. If treated mice survive longer, then x values will tend to be smaller than y values, so Rx will be small. The appropriate test statistic is then Ux. We rank the data: Rank x y Rank 5 1.6 2.1 7 1 0.5 5.3 9 6 1.9 1.4 4 2 0.8 4.6 8 0.9 3 Rx = 14 Ry = 31 We compute Ux = Rx − n(n+1)2 = 14− 4×52 = 4. From tables, with n = 4,m = 5, for a one-sided test, we get: At the 10% level, the critical value is 4, so values of 4 or less are significant. Since Ux = 4 ≤ 4, we reject H0 at the 10% level. At the 5% level, the critical value is 2, so values of 2 or less are significant. Since Ux = 4 > 2, we do not reject H0 at the 5% level. We have 0.05 < p ≤ 0.10. There is only weak evidence that the drug arrests leukaemia. In other words, there is weak evidence that the median survival time for treated mice is greater than for untreated mice. (R gives p = 0.089.) VIDEO Notes about Wilcoxon rank sum test: Ties: When ranking data, the rule for ties is the same as for Wilcoxon signed rank test, average the ranks. Arithmetic check: Rx +Ry = 1 + 2 + · · ·+ (n+m) = (n+m)(n+m+ 1)/2. Terminology: Wilcoxon rank sum test is also called Mann-Whitney U-test. QUIZ 13.4 Parametric versus non-parametric tests Parametric tests: Normal test, t-tests, F -tests. All based on the assumption that random data are drawn from a normal distribution. Non-parametric tests: Sign test, Wilcoxon signed rank test, Mann-Whitney test. These tests do not require such strong distributional assumptions. 103 If we know the data are (approximately) normally distributed, it is best to use the appropriate parametric test. It is because the parametric test has higher power. For a given significance level, provided data are normally distributed, if H1 is true the para- metric test has more chance than the non-parametric test of correctly detecting this and rejecting H0. 104

学霸联盟