统计代写-MATH2801
时间:2022-08-15
School of Mathematics and Statistics
MATH2801-Theory of Statistics
MATH2801–Theory of Statistics
Dr Eka Shinjikashvili
eka@unsw.edu.au
2022, Term 2
MATH2801: Theory of
Statistics
2022, Term 2
These lecture notes were written by Matt Wand then modified and edited by David Warton, Diana Combe, Zdravko Botev,
Libo Lee, Jakub Stoklosa and others.
1
A brief introduction
Statistics is “learning from data” – the science of designing studies and analysing their
results.
Statistics uses a lot of mathematics, and some like statistics because of its challenging
mathematics.
It also involves a lot of other skills too arising in the application of statistical thinking in
practice, and some like statistics because of its usefulness in a range of interesting
applications.
Statistics is pervasive – everyone has data they need to analyse!
Statistical reasoning can give you a completely new perspective on everyday things you
hear about that involve data – news stories, ads, political campaigns, surveys, . . .
In fact it has been argued that an understanding of statistics is essential to truly
understanding the world around you.
2
Example
Shortly before the last New South Wales (NSW) state election, a poll was held asking a
selection of NSW voters people who they would vote for.
Out of 365 respondents, 65% said they would vote the liberal party ahead of Labor.
Some important questions where statistics can help are:
• How accurate is this estimate of the proportion of people voting for Liberal ahead of
Labor?
• Is this sample, of just 365 NSW voters, sufficient to predict that that Liberal party
would win the election?
3
Example
Does smoking while pregnant affect the cognitive development of the foetus?
Johns et al. (1993) conducted a study to look at this question using guinea pigs as a
model. They planned to inject nicotine tartate in a saline solution into some pregnant
guinea pigs, inject no nicotine into others, and compare the cognitive development of
offspring by getting them to complete a simple maze where they look for food.
Some important questions where statistics can help are:
• How many guinea pigs should they include in the experiment?
• How should the “no nicotine” treatment have been applied?
• How should the data be analysed to assess the effect of nicotine on cognitive
development?
4
Some other examples of research questions where we can use statistics are:
• Is climate change affecting species distributions?
• What’s your chance of winning the lottery? Is it worth playing?
• Is there gender bias in promotion?
• Does praying for a patient improve their rate of post-operative recovery (Benson et
al., 2005)?
• Is Sydney getting less rainfall than it used to?
5
Statistics is of fundamental importance in a wide range of disciplines, and as such,
statistical skills are highly valued in the job market.
Some major examples are in research – studying whether a new treatment works,
understanding the effects of a new pest on wildlife; and in business/industry – to study
and predict sales patterns, in trialling new products.
There is a serious shortage of statisticians in the jobs market, which is good news for you
if you choose to do a stats major!
Below are some key ideas in statistics – some of these themes will come up repeatedly in
this course and indeed in the statistics major.
6
Sample vs population
Definition
The population is the total set of subjects we are interested in.
A sample is the subset of the population for which we have data.
A census is a study which obtains data on the whole population.
The vast majority of studies use a sample rather than a census for logistical reasons – it
can be cheaper and it is easier to get data on just a sample, and you can often get
“better data” by spending a lot of time with a few subjects rather than spending a little
time with a lot.
Example
Consider the guinea pigs study of the effects of smoking during pregnancy on offspring.
The study used a sample of guinea pigs – the alternative would be to enroll every guinea
pig on the planet in the study! That’s just not going to happen. . .
7
Example
The Australian Bureau of Statistics (ABS) coordinate a census of all Australians every
five years. This is designed to find out demographic information such as population size,
age of Australians, education, etc. However, they supplement this information with
face-to-face interviews of a sample of people, e.g. for monthly estimates of the
unemployment rate.
When using a sample, a challenging question we will consider is: what can we say about
the population, based on our sample?
8
Description vs inference
Definition
Descriptive statistics refers to methods for summarising data.
Inferential statistics are methods of making statements, decisions or predictions about
a population, based on data obtained from a sample of that population.
Most statistical analyses that you have met in previous studies are descriptive statistics –
calculating a sample mean, histogram, etc.
Calculating descriptive statistics is a key step in analysis, and it is useful for looking for
patterns in a sample.
But often we want to say something general about a population based on the sample,
that is, we want to make inferences about the population.
9
Example
Consider the NSW poll, which included 365 registered NSW voters.
When we say 65% of respondents would vote Liberal party ahead of Labor, we are
reporting a descriptive statistic.
When we use the data to answer the question “How much evidence does this study
provide that the Liberal Party will win the next election?” we are making an inference
about the population of all 7.544 million (!) NSW voters, based on a sample of just 365.
Inference is where things get challenging in statistics – both mathematically and
conceptually – and later in this course we will meet some core tools for making inferences
about populations based on samples, in one-and two-variable situations.
10
Sampling introduces variation
A fundamental idea in statistical inference is that sampling induces variation – different
samples will give you different data, depending on which subjects end up getting included
in the sample.
Example
Consider again the NSW election example, recall that 365 NSW voters were sampled, and
of these, 65% of them would vote for the Liberal party ahead of the Labor party.
If a different 365 NSW voters were sampled, would you expect to get exactly 65% voting
for the Liberal party again?
11
When we want to make inferences about a population based on a sample, we need to
take into account the “sample variation” – the extent to which we would expect results to
vary from one sample to the next.
If we sample randomly, the responses in our response are random, which motivates the
use of probability theory in data analysis.
Probability has a key role to play in statistical inference, so a focus in this course is to
develop key probabilistic ideas.
A chapter on basic probability concepts and terminology is available on Moodle. Please
revise this chapter in your own time.
12
What is the research question?
How you collect data and how you analyse it depends on what the primary research
question you are trying to answer is.
So the first thing to understand when thinking about the design and analysis of a study
is: what is the primary purpose of a study? All else depends on it.
Example
How you collect data depends on the question
Consider the NSW election example where we want to answer the question “Who will win
the next election?”
To answer this question we need to start with a representative sample of NSW voters.
13
In obtaining a representative sample of NSW voters we need to make sure we don’t
include people who are not eligible to vote in the election, for example:
• People under the age of 18.
• People who have not registered to vote.
• People not registered to vote in NSW.
We also need to sample in a manner that gives all NSW voters the opportunity of being
in the sample, to make sure all types of voter are represented.
For example, using “random digit dialing” – dialing random (landline) phone numbers –
this procedure would exclude any voters from the sample who do not have a landline
(because a voter could only own a smart phone instead, or have no phone at all).
14
Example
How you analyse data depends on the question
Consider the following data:
Sample A: 8 30 29 27 26 33 0 42 21 18
Sample B: 37 65 34 78 45 43 21 25 20 75
How should it be analysed?
Below are three graphs that would all be appropriate for their own research questions.
l
l
l
l
l l
l
l
l
l
0 10 20 30 40
20
40
60
80
Sample A
Sa
m
pl
e
B
(a)
A B
0
20
40
60
80
N
um
be
r o
f e
rro
rs
(b)
0
20
40
Pa
ire
d
di
ffe
re
n
ce
(c)
15
These data are actually from the guinea pig experiment, where Sample A is the number
of errors in the maze made by control guinea pigs (no nicotine treatment) and Sample B
is the number of errors by treatment guinea pigs (with nicotine treatment).
Which plot is most suitable for visualising the effect of nicotine on cognitive development
of guinea pigs?
The main lesson here is that whenever collecting or analysing data, or answering
questions on how to do it, you need to keep in mind the primary purpose of the study!
16
Statistics packages
Graphs (and indeed most statistical procedures) are most easily implemented using a
computer, and a statistics package specially developed for data analysis.
Common programs used for statistics:
• R/RStudio (We’ll use RStudio in MATH2801 – used for most graphs in the lecture
notes).
• SAS
• SPSS (PASW)
• Excel
• Minitab
• S + (S-PLUS)
17
Chapter 1: Descriptive statistics
18
Introduction
Before we start looking and thinking about the theory of statistics, we will begin with
some revision of summary statistics. This is a good place to start because we get a feel
for the data at hand and we can refresh some elementary concepts.
So, given a sample of data, {x1, x2, . . . , xn} of sample size n, how would you summarise
it: graphically or numerically?
In this chapter we will briefly review some key tools. Most of this material is considered
to be revision (that is, you would have used most of these concepts in high-school or
other courses), so we will move quickly.
You will not be expected to construct the following numerical and graphical summaries1
by hand, but you should understand how they are produced, know how to produce them
using the statistics package R/RStudio, and know how to interpret such summaries.
1For more details on methods for descriptive statistics see: W. S. Cleveland (1994). Elements of
graphing data, Hobart Press.
19
Two steps to data analysis
The first two things to think about in data analysis are:
1. What is the research question? Descriptive statistics should primarily focus on
providing insight into this question.
2. What are the properties of the variables of primary interest?
20
After Step 1. has been decided, the next important property to think about when
constructing descriptive statistics is whether each variable (in the data set) is
categorical or quantitative.
Definition
Any variable measured on subjects is either categorical or quantitative. More specifically:
Categorical – Responses can be sorted into a finite set of categories, e.g. gender or a
preference to a political party.
Quantitative – Responses take numerical values and are usually measured on some sort
of scale, e.g. height or temperature, or it could be a count of something, e.g. number of
stars in a galaxy.
If the sample {x1, x2, . . . , xn} comes from a quantitative variable, then the xi are real
numbers, xi ∈ ℜ.
If it comes from a categorical variable, then each xi comes from a finite set of
categories or “levels”, xi ∈ {C1, C2, . . . , CK}.
21
Examples
Consider the following questions:
1. Will more people vote the Liberal party ahead of the Labor party at the next election?
2. Are the number of errors made in a maze by offspring of pregnant guinea pigs affected
by whether or not they are given the nicotine treatment?
3. Is gender of a Titanic passenger related to whether or not they survived?
4. How does brain mass change in dinosaurs, as body mass increases?
What are the variables of interest in these questions? Are each of these variables
categorical or quantitative?
1.
2.
3.
4.
22
Summary of descriptive methods
Useful descriptive methods for when we wish to summarise one variable, or the association
between two variables, depend on whether these variables are categorical or quantitative.
Does the research question involve:
One variable Two variables
Data type: Categorical Quantitative Both categorical One of each Both quan-
titative
Numerics: Table offrequencies
 Mean/sdMedian/quantiles Two-way table Mean/sd per group Correlation
Graphs: Bar chart

Dotplot
Boxplot
Histogram
etc.
Clustered bar chart

Clustered dotplot
Clustered boxplot
Clustered histogram
etc.
Scatterplot
We will work through each of the methods mentioned in the above table.
23
Example
Consider again the research questions of the previous example.
What method(s) would you use to construct a graph to answer each research question?
1.
2.
3.
4.
24
Categorical data
We will simultaneously treat the problems of summarising one categorical variable and
studying the association between two categorical variables, because similar methods are
used for these problems.
25
Numerical summaries of categorical data
The main tool for summarising categorical data is a table of frequencies (or percentages).
Definition
A table of frequencies consists of the counts of how many subjects fall into each level
of a categorical variable.
A two-way table (of frequencies) counts how many subjects fall into each combination
of levels from a pair of categorical variables.
26
Example
We can summarise the NSW election poll as follows:
Party Liberal Labor
Frequency 237 128
Example
Consider the question of whether there is an association between gender and whether or
not a passenger on the Titanic survived.
We can summarise the results from passenger records as follows:
Survival outcome
Survived Died
Gender Male 142 709
Female 308 154
If studying the association between two categorical variables, a two-way table
cross-classifies subjects according to how many fall in each combination of categories
across the two variables.
27
Whenever one of the variables of interest has only two possible outcomes, then a list (or
table) of percentages is a useful alternative way to summarise the data.
In the Titanic example (previous slide), an alternative summary is to use the percentage
survival conditional on gender.
We see that a much higher percentage of females survived compared to males: their
survival rate was 100× {308/(308 + 154)} ≈ 67% vs 100× {142/(142 + 709)} ≈ 17%!
If you are interested in an association between more than two categorical variables, it’s
possible to extend the above ideas, e.g. construct a three-way table. . .
28
Graphical summaries of categorical data
A bar chart is a graph of a table of frequencies.
A clustered bar chart graphs a two-way table, spacing the “bars” out as clusters to
indicate the two-variable structure:
Liberal Labour
Bar chart
Fr
eq
ue
nc
y
0
50
10
0
15
0
20
0
male female
Died
Survived
Clustered bar chart
Fr
eq
ue
nc
y
0
20
0
40
0
60
0
29
Pie charts are often used to graph categorical variables, however these are not generally
recommended.
It has been shown that readers of pie charts find it more difficult to understand the
information that is contained in them, e.g. comparing the relative size of frequencies
across categories.
a
b
c
d
e
a b c d e
0
5
10
15
20
(For details, see the Wikipedia entry on pie charts and references therein
http://en.wikipedia.org/wiki/Pie_chart)
30
Quantitative data
When summarising a quantitative variable, we are usually interested in three things:
• Location or “centre” or “central location” – a value around which most of the data
lie.
• Spread – how variable the values are around their centre.
• Shape – other information about a variable apart from location and spread. Skewness
is an important example, which may or may not be a result of suspected
outliers/unusual observations.
31
Numerical summaries of quantitative data
The most commonly used numerical summaries of a quantitative variable are the
(observed) sample mean, variance and standard deviation:
Definition
The sample mean
x¯ = 1
n
n∑
i=1
xi
is a natural measure of location of a quantitative variable.
The sample variance
s2 = 1
n− 1
n∑
i=1
(xi − x¯)2
is a common measure of spread.
The sample standard deviation is defined as s =

s2.
32
The variance is a useful quantity for theoretical purposes, as we will see in the coming
chapters.
The standard deviation however is of more practical interest because it is on the same
scale as the original variable and hence is more readily interpreted.
The sample mean and variance are very widely used and we will derive a range of useful
results about these estimators in this course.
33
Let’s say we order the n values in the dataset and write them in increasing order as
{x(1), x(2), . . . , x(n)}. For example, x(3) is the third smallest observation in the dataset.
Definition
The sample median is
x˜0.5 =

x(n+12 ) if n is odd
1
2
(
x(n2) + x(n+22 )
)
if n is even
More generally, the pth sample quantile of the data x is
x˜p = x(k) where p =
k − 0.5
n
for k ∈ {1, 2, . . . , n}. We can estimate the sample quantile for other values of p by linear
interpolation.
34
The median is sometimes suggested as a measure of location, instead of x¯, because it is
much less sensitive to unusual observations (outliers). However, it is much less widely
used in practice.
There are a number of alternative (but very similar) ways of defining sample quantiles.
A different method again is used as the default approach on the statistics package
R/RStudio.
35
Example
The following (ordered) dataset is the number of mistakes made when ten subjects are
each asked to do a repetitive task 500 times.
2 4 5 7 8 10 14 17 27 35
Find the 5th and 15th sample percentiles of the data. Hence, find the 10th percentile.
There are ten observations (n = 10) in the dataset. For the 5th sample percentile we
have p = 0.05, so
p = k − 0.5
n
⇒ 0.05 = k − 0.510 ⇒ k = 1⇒ x˜0.05 = x(1) = 2.
Similarly, we can show that the 15th sample percentile is 4.
For the 10th sample percentile, we get k = 1.5 which is in the middle of x(1) and x(2). To
estimate this value we can take the average (that is, we interpolate),
x˜0.1 =
1
2
(
x(1) + x(2)
)
= 12 (2 + 4) = 3.
36
Apart from x˜0.5, the two important quantiles are the first and third quartiles, x˜0.25
and x˜0.75 respectively.
Definition
These terms are used to define the interquartile range
IQR = x˜0.75 − x˜0.25
which is sometimes suggested as an alternative measure of spread to the sample standard
deviation, because it is much less sensitive to unusual observations (outliers).
It is however rarely used in practice.
Definition
Another useful numerical summary of sample data is the five number summary, which
consists of the median and quartiles of the sample, together with the minimum and
maximum observed values:
{
x(1), x˜0.25, x˜0.5, x˜0.75, x(n)
}
This ordered selection of numbers can tell you a lot of useful information about a variable
at a glance.
37
Graphical summaries of quantitative data
There are many ways to summarise a variable, and a key thing to consider when choosing
a graphical method is the sample size (n).
Some common plots:
l
l
l
l
l
l
l
0 10 20 30 40 50
Dotchart (small n)
20 30 40 50 60
Boxplot (moderate n) Histogram (large n)
Fr
eq
ue
nc
y
0 20 40 60
0
40
80
12
0
A dotchart is a plot of each variable (x-axis) against its observation number, with data
labels (if available). This is useful for small samples (e.g. n < 20).
38
A boxplot concisely describes location, spread and shape via the median, quartiles and
extremes:
• The line in the middle of the box is the median, the measure of centre.
• The box is bounded by the upper and lower quartiles, so box width is a measure of
spread (the interquartile range, IQR).
• The whiskers extend until the most extreme value within one and a half interquartile
ranges (1.5× IQR) of the nearest quartile.
• Any value farther than 1.5× IQR from its nearest quartile is classified as an extreme
value (or “outlier”), and labelled as a dot or open circle.
Boxplots are most useful for moderate-sized samples (e.g. 10 < n < 50).
39
Definition
A histogram is a plot of the frequencies or relative frequencies of values within different
intervals or bins that cover the range of all observed values in the sample.
This involves breaking the data up into smaller subsamples, and as such it will only find
meaningful structure if the sample is large enough (e.g. n > 30) for the subsamples to
contain non-trivial counts.
An issue in histogram construction is choice of number of bins.
A useful rough rule-of-thumb is to use
number of bins =

n.
40
Kernel density estimator*
A histogram is a step-wise rather than smooth function. A quantitative variable that is
continuous (i.e. a variable that can take any value within some interval) might be better
summarised by a smooth function.
Definition
An alternative estimator that often has better properties for continuous data is a kernel
density estimator:
fˆh(x) =
1
n
n∑
i=1
wh(x− xi)
for some choice of weighting function wh(x) which includes a “bandwidth parameter” h.
Usually, w(x) is chosen to be the normal density (defined in Chapter 3) with mean 0 and
standard deviation h.
A lot of research has studied the issue of how to choose a bandwidth h, and most
statistics packages are now able to automatically choose an estimate of h that usually
performs well.
The larger h is, the larger the bandwidth that is used i.e. the larger the range of observed
values xi that influence estimation of fˆh(x) at any given point x.
41
42
Shape of a distribution
Something we can see from a graph that is hard to see from numerical summaries is the
shape of a distribution. Shape properties, broadly, are characteristics of the distribution
apart from location and spread.
An example of an important shape property is skew – if the data tend to be asymmetric
about its centre, it is skewed. We say data are “left-skewed” if the left tail is longer than
the right, conversely, data are right-skewed if the right-tail is longer.
43
44
Coefficient of skewness*
There are some numerical measures of shape, e.g. the coefficient of skewness κ1:
κˆ1 =
1
(n− 1)s3
n∑
i=1
(xi − x¯)3
but they are rarely used – perhaps because of extreme sensitivity to outliers, and perhaps
because shape properties can be easily visualised as above.
45
Outliers
Definition
Another important thing to look for in graphs is outliers – unusual observations that
might carry large weight in analysis.
Such values need to be investigated – are they errors, are they “special cases” that offer
interesting insights, how dependent are results on these outliers.
46
Summarising associations between variables
We have already considered the situation of summarising the association between
categorical variables, which leaves two possibilities to consider. . .
47
Associations between quantitative variables
Consider a pair of samples from two quantitative variables
{(x1, y1), (x2, y2), . . . , (xn, yn)}.
We would like to understand how the x and y variables are related. Analysis of two
quantitative variables is commonly referred to as (linear) regression2.
Definition
An effective graphical display of the relationship between two quantitative variables is a
scatterplot – a plot of the yi against the xi.
2We won’t spend too much time on regression in MATH2801, in fact, we only cover a few concepts.
More on regression and linear modelling will be covered in MATH2831/2931.
48
Example
How did brain mass change as a function of body size in dinosaurs?
l
l
l
l
l
5e−01 5e+00 5e+01 5e+02 5e+03
1
2
5
10
20
50
20
0
Brain size & body mass relationship in dinosaurs
Body mass (kg) [log scale]
Br
a
in
m
as
s
(m
l) [
log
sc
ale
]
49
Example
How does annual electricity usage change as temperature changes?
−5 0 5 10 15 20 25
20
40
60
80
100
Temperatures ( ◦C)
E
le
c
tr
ic
it
y
u
sa
g
e
(k
W
h
p
e
r
m
o
n
th
)
Temperature vs Electricity usage
50
Definition
An effective numerical summary of the linear relationship between two quantitative
variables is the correlation coefficient (r):
r = 1
n− 1
n∑
i=1
xi − x¯
sx
 yi − y¯
sy

where x¯ and sx are the sample mean and standard deviation of x, similarly for y.
Here, r measures the strength and direction of the association between x and y:
Result
1. |r| ≤ 1.
2. r = −1 if and only if yi = a+ bxi for each i, for some constants a, b such that b < 0.
3. r = 1 if and only if yi = a + bxi for each i, for some constants a, b such that b > 0.
*Can you prove these results? (Hint: consider using the square of xi−x¯sx +
yi−y¯
sy
).
51
52
53
54
55
These results imply that r measures the strength and direction of associations between x
and y:
• Strength of (linear) association – values closer to 1 or -1 suggest that the relationship
is closer to a straight line.
• Direction of association – values less than zero suggest a decreasing relationship,
values greater than zero suggest an increasing relationship
56
Some examples:
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−2 −1 0 1 2

1
0
1
2
3
r = 0.88
l
l
l
l
l
l
l
ll
l
l l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
1 2 3 4
0.
5
1.
0
1.
5
2.
0
2.
5
3.
0
r = −0.03
l
ll
l
l
l
l
l
l
l
l
l
l
l
ll
l
l
l
l
l
l
l
l
l
l l
l
l
l
16 18 20 22 24
4
6
8
10
12
14
r = 0.2
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
−8 −7 −6 −5 −4 −3 −2

9

8

7

6

5

4
r = −0.88
57
Associations between categorical and quantitative variables
When studying whether categorical and quantitative variable are associated, an effective
strategy is to summarise the quantitative variable(s) separately for each level of the
categorical variable(s).
Example
Recall the guinea pig experiment – we want to explore whether there is an association
between a nicotine treatment (categorical) and number of errors made by offspring
(quantitative).
To summarise number of errors, we might typically use mean/sd and a boxplot.
Instead of looking at the association between number of errors and nicotine treatment, we
calculate mean/sd of number of errors for each of the two levels of treatment (nicotine
and no nicotine), and construct a boxplot for each level of treatment:
58
x¯ s
Sample A 23.4 12.3
Sample B 44.3 21.5
A B
0
20
40
60
80
N
um
be
r o
f e
rro
rs
59
In the above example, the boxplots are presented on a common axis – sometimes this is
referred to as comparative boxplots or “side-by-side boxplots”.
An advantage of boxplots over histograms is that they can be quite narrow and hence
readily compared across many samples by stacking them side-by-side.
Some interesting extensions are reviewed in the article “40 years of boxplots” by Hadley
Wickham and Lisa Stryjewski at Rice University.
This idea can be naturally extended to when there are more than two variables.
60
Transforming data
Transforming data is typically done for one of two reasons – to change the scale data
were measured on (linear transformation), or to improve data properties (non-linear
transformation).
We will treat each of these in turn.
61
Linear transformation
Definition
A linear transformation of a sample from a quantitative variable, from
{x1, x2, . . . , xn} to {y1, y2, . . . , yn}, satisfies:
yi = a + bxi for each i and b ̸= 0.
Linear transformation does not affect the shape of a distribution – only its location and
spread.
62
Effects of linear transformation on statistics
Result
Consider a linear transformation yi = a + bxi, i = 1, . . . , n, and its effects on some
statistic to be calculated from the xi (mx) and the yi (my).
If
my = a + bmx
then we say that m is a measure of location.
If mx is a measure of spread in the same units as x,
my = |b|mx.
If mx is a measure of shape then:
my =
 mx if b > 0−mx if b < 0.
63
These results are necessary by definition – to be a measure of location of x, mx has to
“move with the data” under changes of scale.
For mx to be a measure of spread, it needs to be invariant under translation but it needs
to vary with resizing.
A measure of shape on the other hand should be invariant under any change of scale.
64
Examples
Show the following sequence of results: under linear transformation:
1. The sample mean x¯ = 1n
∑n
i=1 xi behaves as a measure of location.
2. The standard deviation sx =

1
n−1
∑n
i=1(xi − x¯)2 behaves as a measure of spread.
3. The correlation coefficient
r = 1
n− 1
n∑
i=1
xi − x¯
sx
 yi − y¯
sy

behaves as a measure of shape (consider linear transformations of xi and yi).
65
66
67
Example
Dinosaur body mass (x) was measured (well, in this case it was estimated!) in kilograms.
If we transform the body mass data into grams instead (denoted y), how will the
following values calculated from y relate to their counterparts calculated from x?
1. y¯, mean body mass in grams.
2. sy, standard deviation of body mass in grams.
3. ry, the correlation between body mass (in grams) and brain mass.
What about when the above three statistics are calculated on log-transformed data,
rather than the raw data?
68
69
A particularly important example of a linear transformation is “standardisation” of data to
z-scores, as below.
Definition
The z-score, or standardised score of a quantitative variable is defined as
z = x− x¯
sx
The z-score is a measure of unusualness – it measures how many standard deviations
above/below the mean a value is (extreme values being unusual ones, far from zero).
We will attach probabilities to precisely how unusual a given z-score is in the coming
chapters.
70
Examples
Sydney’s daily maximum temperature in March has a mean of about 25 degrees Celsius,
and a standard deviation of 2.2. Hence the following z-scores:
A March maximum temperature of 20 degrees in Sydney: z = −2.3.
A maximum temperature of 35 degrees in Sydney: z = 4.5.
Some other unusually large z-scores:
Sachin Tendulkar’s cricket batting average: z = 1.5, and
Don Bradman’s cricket batting average: z = 5.5.
Your winnings if you win the jackpot in the Powerball lotto: z = 7367!!!
71
Nonlinear transformations
If you have a quantitative variable that is strongly skewed, then the patterns you see (in
scatterplots or elsewhere) can be dominated by a few outlying values.
In such cases transforming data can be a good idea – applying a non-linear
transformation to a dataset will change its shape, often changing it for the better!
The most common transformation is a log-transformation.
You can use any base (base 2 and 10 are most common for interpretability) and it doesn’t
really matter. . .
72
Example
Consider brain-mass – body-mass data for dinosaurs, reptiles and birds.
Compare scatterplots of log-transformed data and untransformed data below:
l
ll
l
l
l
l
llll
l
l
l
l
0 1000 2000 3000 4000
0
50
10
0
15
0
20
0
25
0
30
0
35
0
Untransformed variables
Br
a
in
m
as
s
(m
l)
l
l
l
Dinosaur
Reptile
Bird
l
ll
l l
lll
l
l
l
l
ll
ll
l
l
lll
l
l
ll
l
l
l
l
l
l
l
l
l
l
l
l
l
l
l
ll lll
l
ll
l
l
l
l
l
l
l
l
l l
l
ll
l
l
ll
lll
l
l
l
l
l
l
ll
l
l
l
l
ll
l
l
l
l
1e−02 1e+00 1e+02
0.
1
0.
5
5.
0
50
.0
log−transformed variables
Br
a
in
m
as
s
(m
l) [
log
sc
ale
]
l
l
l
Dinosaur
Reptile
Bird
In the untransformed plot, little can be seen except for three outlying values
(Tyrannosaurus, Carhcarodontosaurus and Allosaurus). On transformation, a lot of
interesting structure becomes apparent.
73
One reason why the log-transformation often works so well in revealing structure is the
special property: log(ab) = log(a) + log(b).
This can be understood as taking multiplicative processes and making them additive –
that is, a variable that “grows” in a multiplicative way (e.g. virus transmission, account
balance, size, profit, population size, etc.) can be understood as growing in an additive
way once log-transformed.
This is useful because in graphs (and in most analyses) additive patterns are the easiest
to perceive.
74
Result
Let y = h(x) be some non-linear transformation of real-numbered values x. In most
cases, we have
y¯ ̸= h(x¯).
This point should be kept in mind when analysing transformed data – the mean of
transformed data is a different quantity to the mean of the originally observed variable,
and they do not even have a one-to-one correspondence in most cases!
75
essay、essay代写