xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

扫码添加客服微信

扫描添加客服微信

r studio代写-MT2300

时间：2021-02-25

MT2300

2020/2021

1. Introduction and Preliminaries

Reading

Faraway: Chapter 1 Krzanowski: Chapter 1

Kleinbaum et al: Chapter 4

Frees: Sections 1.1–1.3

Mendenhall et al: Chapter 2

1.1 Motivation

Terminology

We deal with measurements of several variables for each of n experimental units or

individuals. The variables are of two types (though the distinction between them is not

always rigid in applications): those of primary interest to the investigator and those

which might provide supplementary or background information. The variables of the

former type are called response, outcome or dependent variables, while those of latter

type are called explanatory, independent or predictor variables. Econometricians also

use the terms endogenous and exogenous to distinguish the two types of variables. The

explanatory variables are used to predict or to understand the response variables.

Relation between variables, models

We distinguish between a functional relation and a statistical relation. The functional

relation between the independent variable X and the dependent variable Y is often ex-

pressed as a mathematical formula

Y = f(X)

and the main feature of this relation is that the observations (xi, yi) (i = 1, . . . , n) fall

directly on the “curve” of the relationship, that is, on the curve y = f(x).

A statistical relation, unlike a functional relation is not a “perfect”one. Very often

explanatory variables are thought of as fixed, and response variables are thought of as

random variables with a distribution depending on the explanatory variables. Therefore,

for each value of an explanatory variable x the response Y may be supposed to be a

random variable with expectation (mean value) f(x) = E(Y |X = x), E(Y |x) in short.

Then a statistician may wish to determine the function f using sample data consisting of

pairs (xi, yi) (i = 1, . . . , n). Function f is called the regression function for regressing Y

on X and X is called the regressor.

The regression function f(x) represents the systematic component of the model. The

systematic component of the model is concerned with overall population features such as

expected values. To emphasize the existence of the random component of the model, the

response is often written in the form

Y = f(x) + ε

where f is the regression function, that is, the systematic component, and ε is the random

component. In most applications ε is a normal random variable with mean zero and

variance σ2 (ε ∼ N (0, σ2)).

1

Linear statistical model

The systematic component f is often expressed in terms of explanatory variables

through a parametric equation. If, for example, it is supposed that

f(x) = A+Bx+ Cx2

or

f(x) = A2x +B

or

f(x) = A log x+B,

then the problem is reduced to one of identifying a few parameters, here labeled as A,B,C.

In each of these three forms for f given above, f is linear in these parameters.

For example, A2x + B can be written as f(x,β) = g(x)Tβ, where g(x)T = (1, 2x) is

known as transformed input and βT = (B,A) is vector of the model parameters. Similarly,

A log(x) + B writes as g(x)Tβ, where transformed input is given by g(x)T = (1, log(x))

and vector of model parameters is still βT = (B,A).

It is the linearity in the parameters that makes the model a linear statistical model!

1.2 The model of measurements, revision of Year 1 Statistics.

Let µ be an unknown quantity of interest which can be measured with some error. A

mathematical (statistical) model for this experiment is specified by the following equation

(the model equation) Y = µ+ ε, where Y is the available measurement (observation) and

ε is a random error modelled as a random variable with zero mean, say, with normal

distribution with variance σ2, i.e. ε ∼ N(0, σ2). By properties of the normal distribution,

we have that Y ∼ N(µ, σ2). Suppose that we have n measurements Yi = µ + εi, where

εi ∼ N(0, σ2), (i = 1, ..., n, ) are independently distributed. It follows that Y1, ..., Yn are

then also independent random variables with Yi ∼ N(µ, σ2). In other words, Y1, ..., Yn is

a random sample from a normally distributed population with mean µ and variance σ2,

so that the problem of estimating an unknown quantity µ is the well known (from the 1st

Year statistics) problem of estimating a population mean of a normal population. The

sample mean Y¯ = 1

n

(Y1 + ...+ Yn) =

1

n

∑n

i=1 Yi is usually used as a point estimator of µ.

In MT1300 we briefly stated that there are several general methods of obtaining point

estimators. In this course we are going to use one of these methods, namely, the method

of the least squares (LS). To demonstrate the main idea of this method, let us consider

the case of the model of measurements. Given observations Y1, ..., Yn define the following

function

S(µ) =

n∑

i=1

(Yi − µ)2.

The value of µ that minimises S(µ) is called the least square estimator of µ. We can find

the point of minimum of S(µ), by equateing to zero the first derivative of S(µ)

S ′(µ) = −2

n∑

i=1

(Yi − µ) = −2

(

n∑

i=1

Yi − nµ

)

= 0.

Now it is easy to see that the solution of the above equation is Y¯ = 1

n

∑n

i=1 Yi, the sample

mean, and this is the point of minimum, as the second derivative of S(µ) at µ = Y¯ is

2

2n > 0. There is also a direct way to see that the sample mean is the point of minimum,

and, hence, the least square estimator of µ. Indeed,

S(µ) =

n∑

i=1

(Y 2i − 2Yiµ+ µ2) =

n∑

i=1

Y 2i − 2nµY¯ + nµ2

= −2nµY¯ + nµ2 + nY¯ 2 +

n∑

i=1

Y 2i − nY¯ 2

= n(µ− Y¯ )2 +

n∑

i=1

Y 2i − nY¯ 2 ≥

n∑

i=1

Y 2i − nY¯ 2,

where inequality becomes equality if and only if µ = Y¯ . Note finally that

S(Y¯ ) =

n∑

i=1

Y 2i − nY¯ 2 = (n− 1)s2,

where s2 is the sample variance, which is the point estimator of another model parameter

σ2 (see the next section).

3

1.3 Parametric statistical inference, brief revision of

Year 1 background

Reading

Krzanowski: Chapter 2

Kleinbaum et al: Chapter 3

Frees: Chapter 2

Mendenhall et al: Chapter 1

Newbold, Chapter 9

The process of making statements about population characteristics/parameters given

only information from samples is known as parametric statistical inference.

Example 1 A mechanical jar filler for filling jars with coffee does not fill every jar with

the same quantity. The weight of coffee Y filled in a jar is a random variable which can be

assumed to be normally distributed with mean value µ and variance σ2 (Y ∼ N (µ, σ2)).

Suppose that we have a sample of n independent measurements on Y and wish to

“identify” the parameters of the population (µ, σ2).

The sort of statements that we wish to make about parameters will often fall into one

of the following three categories:

• Point estimation;

• Interval estimation;

• Hypotheses testing.

1.3.1 Point estimation

Point estimation is the aspect of statistical inference in which we wish to find the “the

best guess” of the true value of a population parameter.

Suppose that Y1, Y2, · · · , Yn is a random sample from the population of interest.

Then an estimator of an unknown parameter θ is some function of the observations

Y1, Y2, · · · , Yn, that is

θˆ = θˆ(Y1, Y2, · · · , Yn)

(which is in some sense a “good approximation” to the unknown parameter θ).

Example 1 (continued) A point estimator of µ in a N (µ, σ2) population is provided by

the sample mean Y¯ , which is defined by

Y¯ =

1

n

n∑

i=1

Yi =

1

n

(Y1 + Y2 + · · ·+ Yn).

To estimate σ2 in aN(µ, σ2) population we generally use as its estimator the sample variance

s2 defined by

s2 =

1

n− 1

n∑

i=1

(Yi − Y¯ )2.

It is easy to see that s2 = 1

n−1

(∑n

i=1 Y

2

i − nY¯ 2

)

.

4

Properties of Estimators

Let θˆ = θˆ(Y1, Y2, · · · , Yn) be an estimator of an unknown parameter θ. To clarify in

what sense θˆ is a “good approximation” to θ we consider estimators which are (1) unbiased

and (2) mean square consistent.

(1) θˆ is said to be an unbiased estimator of θ if E(θˆ) = θ.

Example 1 (continued) Y¯ is an unbiased estimator of µ and s2 is an unbiased estimator

of σ2.

To check whether we have a sensible estimator we need to ensure that θˆ is increasingly

likely to yield the right answer θ as the sample size n gets bigger. The mean square error

(MSE) of θˆ is defined to be E(θˆ − θ)2. Since the MSE of θˆ is the average square distance

of θˆ from the true value θ, a good estimator is one with a small MSE.

(2) θˆ is said to be a mean square consistent estimator of θ if

MSE(θˆ)→ 0 as n→∞.

Note that if θˆ is unbiased then it is also mean square consistent if Var(θˆ) → 0 with

n→∞.

1.3.2 Interval estimation

Point estimation is often not sufficiently informative as it does not say anything about

the error of the estimation procedure. Naturally, if the error is large, then we are less

confident in our estimate. Replacing a point estimator by an interval estimator allows

us to quantify the uncertainty of estimation by specifying a desirable level of confidence,

which is the probability of the interval capturing the true value of the parameter. Such

interval estimators are known as confidence intervals (C.I.).

Example 1 (continued) To construct a confidence interval for µ we recall that Y¯ is a linear

combination of independent N (µ, σ2) random variables (Y¯ = ∑ni=1 Yi/n) and therefore is

normally distributed with mean µ (unbiased) and variance σ2/n, that is, Y¯ ∼ N (µ, σ2/n).

It therefore follows that if σ2 is known, then

Z =

Y¯ − µ

σ/

√

n

∼ N (0, 1)

and so

P

(

Y¯ − zα/2σ/

√

n ≤ µ ≤ Y¯ + zα/2σ/

√

n

)

= 1− α,

that is, the (1− α)100% confidence interval for µ is given by(

Y¯ − zα/2σ/

√

n, Y¯ + zα/2σ/

√

n

)

.

If σ2 is unknown, then we construct our CI based on the following T -variable

T =

Y¯ − µ

s/

√

n

∼ tn−1,

where tn−1 is the t-distribution with n− 1 degrees of freedom, and

P

(

Y¯ − tn−1,α/2s/

√

n ≤ µ ≤ Y¯ + tn−1,α/2s/

√

n

)

= 1− α,

5

so that the (1− α)100% confidence interval for µ is given by(

Y¯ − tn−1,α/2s/

√

n, Y¯ + tn−1,α/2s/

√

n

)

.

Note that while both the intervals are centered at Y¯ , the margin of error tn−1,α/2s/

√

n

is a random variable unlike the margin of error zα/2σ/

√

n in the normal CI, which is a

non-random quantity (i.e. does not depend on the sample).

Example 1 (continued) Jars of coffee are labeled as 484 grams in weight. A random

sample of ten jars from a production line are opened and weighed accurately. The ten

weights are found to be as follows:

483.7 485.6 486.2 486.0 488.1 480.3 485.4 485.2 483.7 483.3

It is assumed that weights of coffee are normally distributed with mean µ grams and

standard deviation σ.

Find the 95%CI for the true population mean of jar weights.

Using the information provided we find y¯ = 1

10

(y1+· · ·+y10) = 484.75 and s2 = 19

∑10

i=1 y

2

i−

10(y¯)2 = 3.24 (grams squared). Therefore, the 95% CI for the population mean is 484.75±

t9,0.025

s√

n

= (483.462, 486.038), where the critical value t9,0.025 = 2.262 is found in the

Tables (or using software).

1.3.3 Hypotheses testing

Often an investigator has a theory about the phenomenon under study, and wishes

to see whether this theory is confirmed by the data that have been collected. The

null hypothesis H0 is, usually, what we are prepared to “go along with” until we obtain

convincing evidence in favour of the alternative hypothesis H1. To conduct a hypothesis

test we need to complete the following steps.

(1) Specify the null and alternative hypotheses.

(2) Choose a test statistic T which is such that

◦ T behaves differently under the null and alternative hypotheses;

◦ the sampling distribution of T is fully specified when H0 is true.

(3) Formulate some decision rule based on the statistic T .

Whatever decision rule is adopted, there is some chance of reaching an erroneous

conclusion about the population parameter of interest. One error that could be made,

called a Type I error, is the rejection of a true null hypothesis. If the decision rule is

such that the probability of rejecting of a true null hypothesis is α, then α is said to

be the significance level of the test. The other possible error, called Type II error, arises

when a false null hypothesis is accepted. Suppose that for a particular decision rule , the

probability of making such an error is β. Then, the probability of rejecting a false null

hypothesis is (1− β), which is called the power of the test.

6

NULL HYPOTHESIS NULL HYPOTHESIS

TRUE FALSE

ACCEPT Correct decision Type II error

Probability =1− α Probability = β

REJECT Type I error Correct decision

Probability = α Probability =1− β

(significance level) (power)

Ideally we would like to have the probabilities of both types of error as small as possible.

However, in general, once a sample has been taken, any adjustment to the decision rule to

reduce the probability α of type I error automatically increases the probability β of type

II error. The only way of simultaneously lowering both α and β would be to obtain more

information about the population, e.g., by taking a larger sample. In practice we usually

specify significance level (type I error) α to have a small value such as 0.10, 0.05, 0.025,

or 0.01. This then determines the probability of Type II error β (if there is a choice of

tests then we prefer the one with the smallest β, that is, with the highest power (1− β)).

For a given significance level, the bigger is the sample size, the higher will be the power

of the test.

Example 1 (continued) Would you say that the jars are labeled correctly?

So, the statistical model is already specified. We have a random sample of n obser-

vations Y1, Y2, . . . , Yn with Yi ∼ N (µ, σ2). The objective is to test hypotheses about the

unknown population mean.

Consider the problem of testing the simple null hypothesis that the population mean

is equal to some specified value µ0

H0 : µ = µ0

against one of the following three alternative hypotheses

(i) H1 : µ > µ0, (ii) H1 : µ < µ0, (iii) H1 : µ 6= µ0.

Test of the mean of a normal distribution:

Population variance known

Assume first that population variance is known. For all three cases, when the null

hypothesis is true we have

Z =

Y¯ − µ0

σ/

√

n

∼ N (0, 1).

If H1 is true then in case (i) the r.v. Z will tend to be larger (for (ii) Z will tend to be

smaller and for (iii) the absolute value of Z will tend to be larger) than would be expected

for a standard normal random variable. Let us denote by cα the number for which

P{Z > cα} = α

7

where Z ∼ N (0, 1). Then a test with significance level α (type I error) is obtained from

the decision rule:

(i) For H1 : µ > µ0,

Reject H0 if

y¯ − µ0

σ/

√

n

> cα

(ii) For H1 : µ < µ0,

Reject H0 if

y¯ − µ0

σ/

√

n

< −cα

(iii) For H1 : µ 6= µ0,

Reject H0 if

∣∣∣∣ y¯ − µ0σ/√n

∣∣∣∣ > cα2 .

Example 1 (continued) Assume that the standard deviation is given as σ = 1.8 gram.

Test of the mean of a normal distribution:

Population variance unknown

Suppose now that the population variance is no longer assumed known. If the sample

size is not large, the procedures discussed above are no longer appropriate.

To perform a testing procedure we replace σ2 by its estimator, the sample variance s2:

T =

Y¯ − µ0

s/

√

n

.

Now, if the null hypothesis is true then the r.v. T follows a Student’s t distribution with

(n− 1) degrees of freedom (tn−1). Now we can use precisely the same arguments adopted

above with the Student’s t distribution now playing the same role as the standard normal

distribution.

Let us denote by cα the number for which

P{T > cα} = α where T ∼ tn−1

(cα is the (1−α)th quantile, tn−1(1−α), of tn−1 distribution.) Then a test with significance

level α (type I error) is obtained from the decision rule:

(i) For H1 : µ > µ0,

Reject H0 if

y¯ − µ0

s/

√

n

> cα

(ii) For H1 : µ < µ0,

Reject H0 if

y¯ − µ0

s/

√

n

< −cα

(iii) For H1 : µ 6= µ0,

Reject H0 if

∣∣∣∣ y¯ − µ0s/√n

∣∣∣∣ > cα2 .

Example 1 (continued) Assume that the standard deviation is unknown.

Test of the mean of a normal distribution:

Large sample sizes

Suppose that we have a random sample of n observations from a population with mean

µ and variance σ2. If the sample size n is large (n ≥ 30), the test procedures developed

8

for the case where the population variance is known can be employed when it is unknown,

replacing σ2 by the observed sample variance s2. Moreover, these procedures remain

approximately valid even if the population distribution is not normal.

P-value

The smallest significance level at which a null hypothesis can be rejected is called the

probability value or p-value of the test on the given sample.

The p-value gives the probability of observing a value as extreme as the one we have

got or even more extreme, when the null hypothesis is true. Suppose that the data produce

Tobs as the value of the test statistic T . Then we assume that H0 is true and calculate

the probability p of observing a value of T that is as extreme as Tobs or more extreme

than Tobs, where ‘extreme’ is determined by the direction of departure of H1 from H0. For

example, in the above procedures, if we test H1 : µ = µ0 against H1 : µ > µ0 then the

value of T more extreme than Tobs in the direction of departure of H1 would be values

such that T > Tobs. On the other hand, if H1 : µ 6= µ0, then “more extreme” would be

either T > |Tobs| or T < −|Tobs|.

Example 1 (continued) Find the p-value of the test if σ2 is assumed to be known.

In general, to draw conclusions about a test on the basis of the p-value, the following

guidelines are recommended:

1. If p is small (less than 0.01), reject H0.

2. If p is large (greater than 0.10), do not reject H0.

3. If 0.01 < p < 0.10, the significance is borderline: that is, we reject H0 for α = 0.10

but do not reject H0 for α = 0.01.

Note that if we actually do specify α a priori, we reject H0 if p < α.

In this example, the obvious choices of the null and alternative hypotheses are H0 :

µ = 484, H1 : µ 6= 484. Significance level 0.05 (that is 5%) is specified. The standard

deviation σ is unknown, but from the information provided we can estimate it by the

sample standard deviation s =

√

s2= 1.8 gram (and s2 could be calculated from the

statistics y1 + · · ·+ y10 = 4847.5, and y21 + · · ·+ y210 = 2349850).

The test statistic is the t-statistic T = Y¯−µ0

s/

√

n

∼ t9, which has the t-distribution with

9 = 10 - 1 degrees of freedom. Recall from the above that Y¯ = 484.75 and s2 = 3.24.

Therefore, we have got for the sample that Tobs =

484.75−484

1.8/

√

10

= 1.318. Decision with the

critical values (acceptance/rejection regions).

From the tables of t-distribution we find that t9,0.025 = 2.262. So, the corresponding

rejection region (or critical region) is (−∞,−2.262) ∪ (2.262,∞). Then, since −2.262 <

1.318 < 2.2262, we say that at 5% significance level, the data do not provide enough

evidence for rejection of the null hypothesis.

Decision with the p-value. H1 is two-sided, so that p-value = 2(1 − P (T ≤ |Tobs|)) =

2(1− P (T ≤ 1.318)).

P (T ≤ 1.318) is not explicitly given in the Tables but can be approximated by the

closest available values P (T ≤ 1.3) ≤ P (T ≤ 1.318) ≤ P (T ≤ 1.4), that is, or 0.887 ≤

P (T ≤ 1.318) ≤ 0.9025, which gives the p-value of at least 0.195, which is higher than

0.05, hence we do not reject H0.

9

Test for the difference between two means: Matched pairs

Consider a different testing situation in which there are n experimental units, each of

which generates a pair of observations as a result of some treatment. Thus there is a set

of n values before the application of the treatment and then a second set of n values after

the application of the treatment, i.e.

Experimental Unit 1 2 3 . . . n

Before treatment y11 y12 y13 . . . y1n

After treatment y21 y22 y23 . . . y2n

Differences d1 d2 d3 . . . dn

The single sample d1, d2, · · · , dn is formed from the differences of the samples, i.e. di =

y1i − y2i. The objective is to test whether the ‘before’ and ‘after’ populations are the

same. Assume that d1, d2, · · · , dn comes from N(µ, σ2) population; then the procedures

developed for the one-sample test can be employed to investigate the null hypothesis

H0 : µ = 0, where µ = E[D1 −D2], the population mean difference of scores before and

after the the treatment. The three alternative hypotheses are µ 6= 0 or µ < 0 or µ > 0.

Example 2 A group of 12 subjects were given a series of tests to assess their memory,

concentration and capacity to undertake simple arithmetic and logic computations. Their

scores were recorded as Score 1. The same subjects again completed an equivalent series

of tests when they were in the fifth week of a slimming diet and their scores were recorded

as Score 2. the results are given in the table below.

Subject 1 2 3 4 5 6 7 8 9 10 11 12

Score 1 60.2 70.7 39.5 40.3 22.5 53.8 62.5 57.1 54 63.9 59.1 67

Score 2 51.6 63.9 43.3 41.2 20.9 47.3 53.6 60.2 44.3 56.7 47.2 72.3

Do the data support the suggestion that dieting reduces mental effectiveness (during the

period of dieting)?

10

2020/2021

1. Introduction and Preliminaries

Reading

Faraway: Chapter 1 Krzanowski: Chapter 1

Kleinbaum et al: Chapter 4

Frees: Sections 1.1–1.3

Mendenhall et al: Chapter 2

1.1 Motivation

Terminology

We deal with measurements of several variables for each of n experimental units or

individuals. The variables are of two types (though the distinction between them is not

always rigid in applications): those of primary interest to the investigator and those

which might provide supplementary or background information. The variables of the

former type are called response, outcome or dependent variables, while those of latter

type are called explanatory, independent or predictor variables. Econometricians also

use the terms endogenous and exogenous to distinguish the two types of variables. The

explanatory variables are used to predict or to understand the response variables.

Relation between variables, models

We distinguish between a functional relation and a statistical relation. The functional

relation between the independent variable X and the dependent variable Y is often ex-

pressed as a mathematical formula

Y = f(X)

and the main feature of this relation is that the observations (xi, yi) (i = 1, . . . , n) fall

directly on the “curve” of the relationship, that is, on the curve y = f(x).

A statistical relation, unlike a functional relation is not a “perfect”one. Very often

explanatory variables are thought of as fixed, and response variables are thought of as

random variables with a distribution depending on the explanatory variables. Therefore,

for each value of an explanatory variable x the response Y may be supposed to be a

random variable with expectation (mean value) f(x) = E(Y |X = x), E(Y |x) in short.

Then a statistician may wish to determine the function f using sample data consisting of

pairs (xi, yi) (i = 1, . . . , n). Function f is called the regression function for regressing Y

on X and X is called the regressor.

The regression function f(x) represents the systematic component of the model. The

systematic component of the model is concerned with overall population features such as

expected values. To emphasize the existence of the random component of the model, the

response is often written in the form

Y = f(x) + ε

where f is the regression function, that is, the systematic component, and ε is the random

component. In most applications ε is a normal random variable with mean zero and

variance σ2 (ε ∼ N (0, σ2)).

1

Linear statistical model

The systematic component f is often expressed in terms of explanatory variables

through a parametric equation. If, for example, it is supposed that

f(x) = A+Bx+ Cx2

or

f(x) = A2x +B

or

f(x) = A log x+B,

then the problem is reduced to one of identifying a few parameters, here labeled as A,B,C.

In each of these three forms for f given above, f is linear in these parameters.

For example, A2x + B can be written as f(x,β) = g(x)Tβ, where g(x)T = (1, 2x) is

known as transformed input and βT = (B,A) is vector of the model parameters. Similarly,

A log(x) + B writes as g(x)Tβ, where transformed input is given by g(x)T = (1, log(x))

and vector of model parameters is still βT = (B,A).

It is the linearity in the parameters that makes the model a linear statistical model!

1.2 The model of measurements, revision of Year 1 Statistics.

Let µ be an unknown quantity of interest which can be measured with some error. A

mathematical (statistical) model for this experiment is specified by the following equation

(the model equation) Y = µ+ ε, where Y is the available measurement (observation) and

ε is a random error modelled as a random variable with zero mean, say, with normal

distribution with variance σ2, i.e. ε ∼ N(0, σ2). By properties of the normal distribution,

we have that Y ∼ N(µ, σ2). Suppose that we have n measurements Yi = µ + εi, where

εi ∼ N(0, σ2), (i = 1, ..., n, ) are independently distributed. It follows that Y1, ..., Yn are

then also independent random variables with Yi ∼ N(µ, σ2). In other words, Y1, ..., Yn is

a random sample from a normally distributed population with mean µ and variance σ2,

so that the problem of estimating an unknown quantity µ is the well known (from the 1st

Year statistics) problem of estimating a population mean of a normal population. The

sample mean Y¯ = 1

n

(Y1 + ...+ Yn) =

1

n

∑n

i=1 Yi is usually used as a point estimator of µ.

In MT1300 we briefly stated that there are several general methods of obtaining point

estimators. In this course we are going to use one of these methods, namely, the method

of the least squares (LS). To demonstrate the main idea of this method, let us consider

the case of the model of measurements. Given observations Y1, ..., Yn define the following

function

S(µ) =

n∑

i=1

(Yi − µ)2.

The value of µ that minimises S(µ) is called the least square estimator of µ. We can find

the point of minimum of S(µ), by equateing to zero the first derivative of S(µ)

S ′(µ) = −2

n∑

i=1

(Yi − µ) = −2

(

n∑

i=1

Yi − nµ

)

= 0.

Now it is easy to see that the solution of the above equation is Y¯ = 1

n

∑n

i=1 Yi, the sample

mean, and this is the point of minimum, as the second derivative of S(µ) at µ = Y¯ is

2

2n > 0. There is also a direct way to see that the sample mean is the point of minimum,

and, hence, the least square estimator of µ. Indeed,

S(µ) =

n∑

i=1

(Y 2i − 2Yiµ+ µ2) =

n∑

i=1

Y 2i − 2nµY¯ + nµ2

= −2nµY¯ + nµ2 + nY¯ 2 +

n∑

i=1

Y 2i − nY¯ 2

= n(µ− Y¯ )2 +

n∑

i=1

Y 2i − nY¯ 2 ≥

n∑

i=1

Y 2i − nY¯ 2,

where inequality becomes equality if and only if µ = Y¯ . Note finally that

S(Y¯ ) =

n∑

i=1

Y 2i − nY¯ 2 = (n− 1)s2,

where s2 is the sample variance, which is the point estimator of another model parameter

σ2 (see the next section).

3

1.3 Parametric statistical inference, brief revision of

Year 1 background

Reading

Krzanowski: Chapter 2

Kleinbaum et al: Chapter 3

Frees: Chapter 2

Mendenhall et al: Chapter 1

Newbold, Chapter 9

The process of making statements about population characteristics/parameters given

only information from samples is known as parametric statistical inference.

Example 1 A mechanical jar filler for filling jars with coffee does not fill every jar with

the same quantity. The weight of coffee Y filled in a jar is a random variable which can be

assumed to be normally distributed with mean value µ and variance σ2 (Y ∼ N (µ, σ2)).

Suppose that we have a sample of n independent measurements on Y and wish to

“identify” the parameters of the population (µ, σ2).

The sort of statements that we wish to make about parameters will often fall into one

of the following three categories:

• Point estimation;

• Interval estimation;

• Hypotheses testing.

1.3.1 Point estimation

Point estimation is the aspect of statistical inference in which we wish to find the “the

best guess” of the true value of a population parameter.

Suppose that Y1, Y2, · · · , Yn is a random sample from the population of interest.

Then an estimator of an unknown parameter θ is some function of the observations

Y1, Y2, · · · , Yn, that is

θˆ = θˆ(Y1, Y2, · · · , Yn)

(which is in some sense a “good approximation” to the unknown parameter θ).

Example 1 (continued) A point estimator of µ in a N (µ, σ2) population is provided by

the sample mean Y¯ , which is defined by

Y¯ =

1

n

n∑

i=1

Yi =

1

n

(Y1 + Y2 + · · ·+ Yn).

To estimate σ2 in aN(µ, σ2) population we generally use as its estimator the sample variance

s2 defined by

s2 =

1

n− 1

n∑

i=1

(Yi − Y¯ )2.

It is easy to see that s2 = 1

n−1

(∑n

i=1 Y

2

i − nY¯ 2

)

.

4

Properties of Estimators

Let θˆ = θˆ(Y1, Y2, · · · , Yn) be an estimator of an unknown parameter θ. To clarify in

what sense θˆ is a “good approximation” to θ we consider estimators which are (1) unbiased

and (2) mean square consistent.

(1) θˆ is said to be an unbiased estimator of θ if E(θˆ) = θ.

Example 1 (continued) Y¯ is an unbiased estimator of µ and s2 is an unbiased estimator

of σ2.

To check whether we have a sensible estimator we need to ensure that θˆ is increasingly

likely to yield the right answer θ as the sample size n gets bigger. The mean square error

(MSE) of θˆ is defined to be E(θˆ − θ)2. Since the MSE of θˆ is the average square distance

of θˆ from the true value θ, a good estimator is one with a small MSE.

(2) θˆ is said to be a mean square consistent estimator of θ if

MSE(θˆ)→ 0 as n→∞.

Note that if θˆ is unbiased then it is also mean square consistent if Var(θˆ) → 0 with

n→∞.

1.3.2 Interval estimation

Point estimation is often not sufficiently informative as it does not say anything about

the error of the estimation procedure. Naturally, if the error is large, then we are less

confident in our estimate. Replacing a point estimator by an interval estimator allows

us to quantify the uncertainty of estimation by specifying a desirable level of confidence,

which is the probability of the interval capturing the true value of the parameter. Such

interval estimators are known as confidence intervals (C.I.).

Example 1 (continued) To construct a confidence interval for µ we recall that Y¯ is a linear

combination of independent N (µ, σ2) random variables (Y¯ = ∑ni=1 Yi/n) and therefore is

normally distributed with mean µ (unbiased) and variance σ2/n, that is, Y¯ ∼ N (µ, σ2/n).

It therefore follows that if σ2 is known, then

Z =

Y¯ − µ

σ/

√

n

∼ N (0, 1)

and so

P

(

Y¯ − zα/2σ/

√

n ≤ µ ≤ Y¯ + zα/2σ/

√

n

)

= 1− α,

that is, the (1− α)100% confidence interval for µ is given by(

Y¯ − zα/2σ/

√

n, Y¯ + zα/2σ/

√

n

)

.

If σ2 is unknown, then we construct our CI based on the following T -variable

T =

Y¯ − µ

s/

√

n

∼ tn−1,

where tn−1 is the t-distribution with n− 1 degrees of freedom, and

P

(

Y¯ − tn−1,α/2s/

√

n ≤ µ ≤ Y¯ + tn−1,α/2s/

√

n

)

= 1− α,

5

so that the (1− α)100% confidence interval for µ is given by(

Y¯ − tn−1,α/2s/

√

n, Y¯ + tn−1,α/2s/

√

n

)

.

Note that while both the intervals are centered at Y¯ , the margin of error tn−1,α/2s/

√

n

is a random variable unlike the margin of error zα/2σ/

√

n in the normal CI, which is a

non-random quantity (i.e. does not depend on the sample).

Example 1 (continued) Jars of coffee are labeled as 484 grams in weight. A random

sample of ten jars from a production line are opened and weighed accurately. The ten

weights are found to be as follows:

483.7 485.6 486.2 486.0 488.1 480.3 485.4 485.2 483.7 483.3

It is assumed that weights of coffee are normally distributed with mean µ grams and

standard deviation σ.

Find the 95%CI for the true population mean of jar weights.

Using the information provided we find y¯ = 1

10

(y1+· · ·+y10) = 484.75 and s2 = 19

∑10

i=1 y

2

i−

10(y¯)2 = 3.24 (grams squared). Therefore, the 95% CI for the population mean is 484.75±

t9,0.025

s√

n

= (483.462, 486.038), where the critical value t9,0.025 = 2.262 is found in the

Tables (or using software).

1.3.3 Hypotheses testing

Often an investigator has a theory about the phenomenon under study, and wishes

to see whether this theory is confirmed by the data that have been collected. The

null hypothesis H0 is, usually, what we are prepared to “go along with” until we obtain

convincing evidence in favour of the alternative hypothesis H1. To conduct a hypothesis

test we need to complete the following steps.

(1) Specify the null and alternative hypotheses.

(2) Choose a test statistic T which is such that

◦ T behaves differently under the null and alternative hypotheses;

◦ the sampling distribution of T is fully specified when H0 is true.

(3) Formulate some decision rule based on the statistic T .

Whatever decision rule is adopted, there is some chance of reaching an erroneous

conclusion about the population parameter of interest. One error that could be made,

called a Type I error, is the rejection of a true null hypothesis. If the decision rule is

such that the probability of rejecting of a true null hypothesis is α, then α is said to

be the significance level of the test. The other possible error, called Type II error, arises

when a false null hypothesis is accepted. Suppose that for a particular decision rule , the

probability of making such an error is β. Then, the probability of rejecting a false null

hypothesis is (1− β), which is called the power of the test.

6

NULL HYPOTHESIS NULL HYPOTHESIS

TRUE FALSE

ACCEPT Correct decision Type II error

Probability =1− α Probability = β

REJECT Type I error Correct decision

Probability = α Probability =1− β

(significance level) (power)

Ideally we would like to have the probabilities of both types of error as small as possible.

However, in general, once a sample has been taken, any adjustment to the decision rule to

reduce the probability α of type I error automatically increases the probability β of type

II error. The only way of simultaneously lowering both α and β would be to obtain more

information about the population, e.g., by taking a larger sample. In practice we usually

specify significance level (type I error) α to have a small value such as 0.10, 0.05, 0.025,

or 0.01. This then determines the probability of Type II error β (if there is a choice of

tests then we prefer the one with the smallest β, that is, with the highest power (1− β)).

For a given significance level, the bigger is the sample size, the higher will be the power

of the test.

Example 1 (continued) Would you say that the jars are labeled correctly?

So, the statistical model is already specified. We have a random sample of n obser-

vations Y1, Y2, . . . , Yn with Yi ∼ N (µ, σ2). The objective is to test hypotheses about the

unknown population mean.

Consider the problem of testing the simple null hypothesis that the population mean

is equal to some specified value µ0

H0 : µ = µ0

against one of the following three alternative hypotheses

(i) H1 : µ > µ0, (ii) H1 : µ < µ0, (iii) H1 : µ 6= µ0.

Test of the mean of a normal distribution:

Population variance known

Assume first that population variance is known. For all three cases, when the null

hypothesis is true we have

Z =

Y¯ − µ0

σ/

√

n

∼ N (0, 1).

If H1 is true then in case (i) the r.v. Z will tend to be larger (for (ii) Z will tend to be

smaller and for (iii) the absolute value of Z will tend to be larger) than would be expected

for a standard normal random variable. Let us denote by cα the number for which

P{Z > cα} = α

7

where Z ∼ N (0, 1). Then a test with significance level α (type I error) is obtained from

the decision rule:

(i) For H1 : µ > µ0,

Reject H0 if

y¯ − µ0

σ/

√

n

> cα

(ii) For H1 : µ < µ0,

Reject H0 if

y¯ − µ0

σ/

√

n

< −cα

(iii) For H1 : µ 6= µ0,

Reject H0 if

∣∣∣∣ y¯ − µ0σ/√n

∣∣∣∣ > cα2 .

Example 1 (continued) Assume that the standard deviation is given as σ = 1.8 gram.

Test of the mean of a normal distribution:

Population variance unknown

Suppose now that the population variance is no longer assumed known. If the sample

size is not large, the procedures discussed above are no longer appropriate.

To perform a testing procedure we replace σ2 by its estimator, the sample variance s2:

T =

Y¯ − µ0

s/

√

n

.

Now, if the null hypothesis is true then the r.v. T follows a Student’s t distribution with

(n− 1) degrees of freedom (tn−1). Now we can use precisely the same arguments adopted

above with the Student’s t distribution now playing the same role as the standard normal

distribution.

Let us denote by cα the number for which

P{T > cα} = α where T ∼ tn−1

(cα is the (1−α)th quantile, tn−1(1−α), of tn−1 distribution.) Then a test with significance

level α (type I error) is obtained from the decision rule:

(i) For H1 : µ > µ0,

Reject H0 if

y¯ − µ0

s/

√

n

> cα

(ii) For H1 : µ < µ0,

Reject H0 if

y¯ − µ0

s/

√

n

< −cα

(iii) For H1 : µ 6= µ0,

Reject H0 if

∣∣∣∣ y¯ − µ0s/√n

∣∣∣∣ > cα2 .

Example 1 (continued) Assume that the standard deviation is unknown.

Test of the mean of a normal distribution:

Large sample sizes

Suppose that we have a random sample of n observations from a population with mean

µ and variance σ2. If the sample size n is large (n ≥ 30), the test procedures developed

8

for the case where the population variance is known can be employed when it is unknown,

replacing σ2 by the observed sample variance s2. Moreover, these procedures remain

approximately valid even if the population distribution is not normal.

P-value

The smallest significance level at which a null hypothesis can be rejected is called the

probability value or p-value of the test on the given sample.

The p-value gives the probability of observing a value as extreme as the one we have

got or even more extreme, when the null hypothesis is true. Suppose that the data produce

Tobs as the value of the test statistic T . Then we assume that H0 is true and calculate

the probability p of observing a value of T that is as extreme as Tobs or more extreme

than Tobs, where ‘extreme’ is determined by the direction of departure of H1 from H0. For

example, in the above procedures, if we test H1 : µ = µ0 against H1 : µ > µ0 then the

value of T more extreme than Tobs in the direction of departure of H1 would be values

such that T > Tobs. On the other hand, if H1 : µ 6= µ0, then “more extreme” would be

either T > |Tobs| or T < −|Tobs|.

Example 1 (continued) Find the p-value of the test if σ2 is assumed to be known.

In general, to draw conclusions about a test on the basis of the p-value, the following

guidelines are recommended:

1. If p is small (less than 0.01), reject H0.

2. If p is large (greater than 0.10), do not reject H0.

3. If 0.01 < p < 0.10, the significance is borderline: that is, we reject H0 for α = 0.10

but do not reject H0 for α = 0.01.

Note that if we actually do specify α a priori, we reject H0 if p < α.

In this example, the obvious choices of the null and alternative hypotheses are H0 :

µ = 484, H1 : µ 6= 484. Significance level 0.05 (that is 5%) is specified. The standard

deviation σ is unknown, but from the information provided we can estimate it by the

sample standard deviation s =

√

s2= 1.8 gram (and s2 could be calculated from the

statistics y1 + · · ·+ y10 = 4847.5, and y21 + · · ·+ y210 = 2349850).

The test statistic is the t-statistic T = Y¯−µ0

s/

√

n

∼ t9, which has the t-distribution with

9 = 10 - 1 degrees of freedom. Recall from the above that Y¯ = 484.75 and s2 = 3.24.

Therefore, we have got for the sample that Tobs =

484.75−484

1.8/

√

10

= 1.318. Decision with the

critical values (acceptance/rejection regions).

From the tables of t-distribution we find that t9,0.025 = 2.262. So, the corresponding

rejection region (or critical region) is (−∞,−2.262) ∪ (2.262,∞). Then, since −2.262 <

1.318 < 2.2262, we say that at 5% significance level, the data do not provide enough

evidence for rejection of the null hypothesis.

Decision with the p-value. H1 is two-sided, so that p-value = 2(1 − P (T ≤ |Tobs|)) =

2(1− P (T ≤ 1.318)).

P (T ≤ 1.318) is not explicitly given in the Tables but can be approximated by the

closest available values P (T ≤ 1.3) ≤ P (T ≤ 1.318) ≤ P (T ≤ 1.4), that is, or 0.887 ≤

P (T ≤ 1.318) ≤ 0.9025, which gives the p-value of at least 0.195, which is higher than

0.05, hence we do not reject H0.

9

Test for the difference between two means: Matched pairs

Consider a different testing situation in which there are n experimental units, each of

which generates a pair of observations as a result of some treatment. Thus there is a set

of n values before the application of the treatment and then a second set of n values after

the application of the treatment, i.e.

Experimental Unit 1 2 3 . . . n

Before treatment y11 y12 y13 . . . y1n

After treatment y21 y22 y23 . . . y2n

Differences d1 d2 d3 . . . dn

The single sample d1, d2, · · · , dn is formed from the differences of the samples, i.e. di =

y1i − y2i. The objective is to test whether the ‘before’ and ‘after’ populations are the

same. Assume that d1, d2, · · · , dn comes from N(µ, σ2) population; then the procedures

developed for the one-sample test can be employed to investigate the null hypothesis

H0 : µ = 0, where µ = E[D1 −D2], the population mean difference of scores before and

after the the treatment. The three alternative hypotheses are µ 6= 0 or µ < 0 or µ > 0.

Example 2 A group of 12 subjects were given a series of tests to assess their memory,

concentration and capacity to undertake simple arithmetic and logic computations. Their

scores were recorded as Score 1. The same subjects again completed an equivalent series

of tests when they were in the fifth week of a slimming diet and their scores were recorded

as Score 2. the results are given in the table below.

Subject 1 2 3 4 5 6 7 8 9 10 11 12

Score 1 60.2 70.7 39.5 40.3 22.5 53.8 62.5 57.1 54 63.9 59.1 67

Score 2 51.6 63.9 43.3 41.2 20.9 47.3 53.6 60.2 44.3 56.7 47.2 72.3

Do the data support the suggestion that dieting reduces mental effectiveness (during the

period of dieting)?

10