STAD37H3-无代写
时间:2023-10-01
STAD37H3 Multivariate
Statistical Analysis
Fall 2023
Instructor: Mahinda Samarakoon
1
2
3
4
5
Multivariate Data
Chap 1
6
Multivariate data and data matrix
7
Multivariate Data and data matrix
8
Descriptive Statistics (Summary statistics)
9
• Sample means: ҧ =
1
σ=1
, = 1,2, … , (sample mean of the
ℎ variable)
• Sample covariance
• =
1
σ=1
( − ҧ)(− ҧ) , = 1,2, … , , = 1,2, … , is the sample covariance
between the ℎ and ℎ variables
• Note: Sometimes we (and our textbook) divide by n-1 instead of n. I will highlight when
we do that.
• Note =
1
σ=1
− ҧ ҧ
• Sample correlations:
• =
=
σ=1
(− ҧ)(− ҧ)
σ=1
− ҧ
2
σ=1
− ҧ
2
Properties of Correlation (we have seen these
in STAC67H3)
• Correlation requires both variables to be quantitative.
• = 1
• Because r uses standardized values of observations, its value is unchanged under linear transformations = + and =
+ provided that the constant and have the same sign.
• Positive r indicates positive association between the variables and negative r indicates negative association.
• is always a number between –1 and 1.
• Values of near 0 indicates a very weak linear relationship.
• The strength of the linear relationship increases as r moves away from 0.
• Values of close to –1 or 1 indicates that the points lie close to a straight line.
• is not resistant. is strongly affected by a few outliers.
• When you calculate a correlation, it doesn't matter which variable is x and which is y
10
Descriptive Statistics (Summary statistics)
11
• Sample mean vector
• ҧ =
ҧ1
ҧ2
⋮
ҧ
Note: ҧ =
1
σ=1
=
1
X′1
Descriptive Statistics (Summary statistics)
• Sample Covariance matrix
• = × =
11 12 … 1
21 22 … 2
⋮
1
⋮
2
⋱
⋯
⋮
• Note: =
1
σ=1
( − ҧ)( − ҧ)′
• =
1
σ=1
′ − ҧ ҧ’
• =
1
′ − ҧ ҧ’
• =
1
′ −(
1
X′1)(
1
X′1)’ (∵ ҧ =
1
σ=1
=
1
X′1)
• =
1
′ −
1
′11′
• =
1
′ −
1
11′
• =
1
′, where n = −
1
11′ = −
1
is the centering matrix.
12
Descriptive Statistics (Summary statistics)
• Note: centering matrix n = −
1
11′ = −
1
is a symmetric
matrix (i.e. n′ = n)
• Note: =
1
′ where =
13
• Exercises
• 1) Prove that the centering matrix nis an idempotent matrix (i.e.
nn = n)
• 2) Prove that n1 = 0 where 0 is the × 1 vector of zeros.
• 3) If = 1, 2, … , ′ ∈
,
• Show that
• a) = − ҧ1 where ҧ =
1
σ=1
• b) ′ = σ=1
− ҧ
2
14
Descriptive Statistics (Summary statistics)
• Note: The diagonal elements of are the sample variances of the
variables and so non-negative
• ⟹ ≥ 0
• ⇒ σ=1
≥ 0 where (1, 2 … , ) are the eigenvalues of
• Note: For any × matrix with eigenvalues (1, 2 … , ) ,
• = σ=1
and det = ς=1
15
Sample Correlation matrix
• Where =
=
σ=1
(− ҧ)(− ҧ)
σ=1
− ҧ
2
σ=1
− ҧ
2
• Note: If = , then the variables are uncorrelated
16
Sample Correlation matrix
• We can calculate the correlation matrix using
• =
1
′
• Where =
−1
• n = −
1
11′ = −
1
is the centering matrix
• = (1, 2, … , ) where = is the sample standard deviation of the
ℎ variable.
17
Sample correlation Matrix
• Note: = −1
−1 and =
• E.g. =
4 1 2
1 9 −3
2 −3 25
• =
2 0 0
0 3 0
0 0 5
• =
1/2 0 0
0 1/3 0
0 0 1/5
4 1 2
1 9 −3
2 −3 25
1/2 0 0
0 1/3 0
0 0 1/5
• =
1 1/6 1/5
1/6 1 −1/5
1/5 −1/5 1
• E.g. 2 and 3 are negatively correlated.
18
Descriptive Statistics (Summary statistics)
• Example (Bookstore example)
19
Bookstore example (cont’d)
20
Calculations using R (The same data for the previous Example)
• # Some calculations with R
• x1 <- c(42, 52, 48, 58)
• x2 <- c(4, 5, 4, 3)
• X <- cbind(x1,x2)
• X
• x1 x2
• [1,] 42 4
• [2,] 52 5
• [3,] 48 4
• [4,] 58 3
• dim(X)
• [1] 4 2
21
Calculations using R (The same data for the previous Example)
• #Calculating sample mean vector
• #Method 1
• rowMeans(X) # if you want the row means, but that is not what we want now
• xbar <- colMeans(X)
• xbar <- as.matrix(xbar)
• xbar
• [,1]
• x1 50
• x2 4
• #Method 2
• xbar <- as.matrix(apply(X,2,mean)) # 2 for column means and 1 for row means
• Xbar
• [,1]
• x1 50
• x2 4
• #Method 3
• xbar <- as.matrix(c(mean(X[,1]), mean(X[,2])))
• xbar
• [,1]
• x1 50
• x2 4
22
Calculations using R (The same data for the previous Example)
• # Method 4, use the formula
• n <- nrow(X)
• one <- as.matrix(c(rep(1,n)))
• one
• [,1]
• [1,] 1
• [2,] 1
• [3,] 1
• [4,] 1
• xbar <- (1/n)*t(X)%*%one
• xbar
• [,1]
• x1 50
• x2 4
23
• # Covariance matrix
• n <- nrow(X)
• one <- as.matrix(c(rep(1,n)))
• I <- diag(n)
• I
• [,1] [,2] [,3] [,4]
• [1,] 1 0 0 0
• [2,] 0 1 0 0
• [3,] 0 0 1 0
• [4,] 0 0 0 1
• J <- one%*%t(one)
• J
• [,1] [,2] [,3] [,4]
• [1,] 1 1 1 1
• [2,] 1 1 1 1
• [3,] 1 1 1 1
• [4,] 1 1 1 1
• C <- I-(1/n)*J
• Xc <- C%*%X
• Sn <- (1/n)*t(Xc)%*%Xc
• Sn
• x1 x2
• X1 34.0 -1.5
• x2 -1.5 0.5
24
• # Method 2 (easier)
• cov(X)
• x1 x2
• x1 45.33333 -2.0000000
• x2 -2.00000 0.6666667
• # R divides by n-1
• Sn <- ((n-1)/n)*cov(X)
• Sn
• x1 x2
• x1 34.0 -1.5
• x2 -1.5 0.5
25
• # Correlation matrix
• si <- sqrt(diag(Sn))
• D <- diag(si)
• D
• [,1] [,2]
• [1,] 5.830952 0.0000000
• [2,] 0.000000 0.7071068
• Dinv <- solve(D)
• Xs <- C%*%X%*%Dinv
• R <- (1/n)*t(Xs)%*%Xs
• R
• [,1] [,2]
• [1,] 1.0000000 -0.3638034
• [2,] -0.3638034 1.0000000
• # Mathod 2 (easier)
• R <- cor(X)
• R
• [,1] [,2]
• [1,] 1.0000000 -0.3638034
• [2,] -0.3638034 1.0000000
26
Graphical methods
• Data quality checking
• Graphical description of data must be done before formal data
analysis
• Tools to perform data quality checks consists of univariate and
multivariate tools
• Univariate tools : histograms, stemplots, boxplots etc
• Bivariate: pairwise scatterplots
• Identify (by eye): outliers, relationships, grouping etc. as we discussed
in STAC67H3
• Normal quantile plots: Check for Normality of data
27
Graphical methods
• Example:
28
Graphical methods
• paper <- read.table("t1_2.txt", header = FALSE)
• head(paper)
• V1 V2 V3
• 1 0.801 121.41 70.42
• 2 0.824 127.70 72.47
• 3 0.841 129.20 78.20
• 4 0.816 131.80 74.89
• 5 0.840 135.10 71.21
• 6 0.842 131.50 78.39
• cols <- c("Density","Strength(MD)","Strength(CD)")
• colnames(paper) <- cols
• head(paper)
• Density Strength(MD) Strength(CD)
• 1 0.801 121.41 70.42
• 2 0.824 127.70 72.47
• 3 0.841 129.20 78.20
• 4 0.816 131.80 74.89
• 5 0.840 135.10 71.21
• 6 0.842 131.50 78.39
29
• # Scatterplot matrix
• plot(paper)
30
• par(mfrow=c(3,3))
• boxplot(paper$Density, main = "Density")
• boxplot(paper$`Strength(MD)`, main = "Strength(MD)")
• boxplot(paper$`Strength(CD)`, main = "Strength(CD)")
• #-----------------------------------------------
• hist(paper$Density, main = "Density")
• hist(paper$`Strength(MD)`, main = "Strength(MD)")
• hist(paper$`Strength(CD)`, main = "Strength(CD)")
• #------------------------------------------------
• qqnorm(paper$Density, main = "Density")
• qqnorm(paper$`Strength(MD)`, main = "Strength(MD)")
• qqnorm(paper$`Strength(CD)`, main = "Strength(CD)")
31
Andrews Curves (D. Andrews, 1972)
• The idea of coding and representing multivariate data by curves was suggested by
Andrews (1972).
• Each multivariate observation = (1, 2, … , ) is transformed into a curve
as follows:
• the observation represents the coefficients of a so-called Fourier series ∈
[−, ]
32
Andrews Curves (D. Andrews, 1972)
Swiss bank notes (identify counterfeit note?)
R library Andrews has this data set called “banknote”
The data set has data on 200 notes
Variables
-conterfeit: Wether a banknote is conterfeit (1) or genuine (0)
-Length: Length of bill (mm)
-Left: Width of left edge (mm)
-Right: Width of right edge (mm)
-Bottom: Bottom margin width (mm)
-Top: Top margin width (mm)
-Diagonal: Length of diagonal (mm)
33
Andrews Curves (D. Andrews, 1972)
• Let us take the 96th observation of the Swiss bank note data set,
• 96 = (215.6, 129.9, 129.9, 9.0, 9.5, 141.7)
• The Andrews’ curve is by
34
Andrews Curves (R code, using R library andrews)
• library(andrews)
• > x <- banknote # data set
• > head(x)
• Status Length Left Right Bottom Top Diagonal
• 1 genuine 214.8 131.0 131.1 9.0 9.7 141.0
• 2 genuine 214.6 129.7 129.7 8.1 9.5 141.7
• 3 genuine 214.8 129.7 129.7 8.7 9.6 142.2
• 4 genuine 214.8 129.7 129.6 7.5 10.4 142.0
• 5 genuine 215.0 129.6 129.7 10.4 7.7 141.8
• 6 genuine 215.7 130.8 130.5 9.0 10.1 141.4
• > notesdata <- x[90:101,] # I am selecting only obns 90 to 101
• > andrews(notesdata, clr=1, ymax=4)
35
Andrews Curves (R code, using R library andrews)
36
Andrews Curves, Some important notes
37
• R and other software scales the data (variables) to have the same range before
calculating the Andrews function.
• E.g To scale to the range [0, 1], replace any column x by (x-min(x))/(max(x)-min(x))
• The shape of these curves depends on the order of the variables
• If is high-dimensional (i.e. for large ), then the last variables in will have only a
small visible contribution to shape of the curve.
• To overcome this problem Andrews suggested using an order which is suggested by
Principal Component Analysis (we will discuss this topic later).
• When is large, there may be too many curves on the graph, making difficult to
interpret.
You can write your own code
• > library(andrews)
• > #x <- read.csv("swiss.csv", header = TRUE)
• > x <- banknote
• > notesdata <- x[90:101,2:7] # I pick only 11 observations
• > # Now draw Andrews curves for the selected data set i.e. xr
• > x <- notesdata
• > xs <- scale(x, center=apply(x, 2, min), scale=apply(x, 2, max)-apply(x, 2, min))
• > # Andrews curve f(t) for an observation vector v in R^6
• > f <- function(t,v){
• v[1]*(1/sqrt(2))+v[2]*sin(t)+v[3]*cos(t)+v[4]*sin(2*t)+v[5]*cos(2*t)+v[6]*sin(3*t)
• }
• # set up plot window, but no plot yet
• > plot(0, 0, xlim = c(-pi, pi), ylim = c(-3, 3), xlab = "t", ylab = "Andrews Curves", main = "", type
= "n")
• > # type=n means no plot yet
• > # now add the Andrews curves for each observation
•
38
• n <- nrow(x)
• > t <- seq(-pi, pi, len = 100)
• > dim(t) <- length(t)
• > for (i in 1:n) {
• y <- apply(t, MARGIN = 1, FUN = f, v = xs[i, ])
• lines(t, y)
• }
39
40
Three-Dimensional Scatterplot in R
• R has a built in data set called mtcars
• A data frame with 32 observations on 11 (numeric) variables.
• [, 1] mpg Miles/(US) gallon
• [, 2] cyl Number of cylinders
• [, 3] disp Displacement (cu.in.)
• [, 4] hp Gross horsepower
• [, 5] drat Rear axle ratio
• [, 6] wt Weight (1000 lbs)
• [, 7] qsec 1/4 mile time
• [, 8] vs Engine (0 = V-shaped, 1 = straight)
• [, 9] am Transmission (0 = automatic, 1 = manual)
• [,10] gear Number of forward gears
• [,11] carb Number of carburetors
41
Three-Dimensional Scatterplot in R
• > head(mtcars)
> class(mtcars)
• [1] "data.frame"
• > dim(mtcars)
• [1] 32 11
42
Three-Dimensional Scatterplot in R
• > library(scatterplot3d)
• > scatterplot3d(mtcars$hp, mtcars$wt,
mtcars$mpg,xlab="HP", ylab="WT", zlab="MPG")
43
Three-Dimensional Scatterplot in R
• > library(scatterplot3d)
• > scatterplot3d(mtcars$hp, mtcars$wt,
mtcars$mpg,xlab="HP", ylab="WT", zlab="MPG", type="h")
44
Three-Dimensional Scatterplot in R
• > library(scatterplot3d)
• > scatterplot3d(mtcars$hp, mtcars$wt, mtcars$mpg,xlab="HP",
ylab="WT", zlab="MPG", type="h", color="green")
45
Three-Dimensional Scatterplot in R
• > mtcars$pcol[mtcars$cyl==4] <- "red"
• > mtcars$pcol[mtcars$cyl==6] <- "blue"
• > mtcars$pcol[mtcars$cyl==8] <- "green"
• > scatterplot3d(mtcars$hp, mtcars$wt, mtcars$mpg, xlab="HP",
ylab="WT", zlab="MPG", type="h", color=mtcars$pcol)
46
Three-Dimensional Scatterplot in R
• > mtcars$pcol[mtcars$cyl==4] <- "red"
• > mtcars$pcol[mtcars$cyl==6] <- "blue"
• > mtcars$pcol[mtcars$cyl==8] <- "green"
• > sp3d <- scatterplot3d(mtcars$hp, mtcars$wt, mtcars$mpg,
xlab="HP", ylab="WT", zlab="MPG", type="h", color=mtcars$pcol)
• > fit <- lm(mpg ~ hp + wt, data=mtcars)
• > sp3d$plane3d(fit)
47
Distance (1.5, p30)
• Most multivariate techniques are based on the concept of distance
• Euclidian distance between the points (0,0, … , 0) and (1, 2, … , )
is , = 1
2 + 2
2 + ⋯ +
2 = = ′ where =
1
2
⋮
• Euclidian distance between the points (1, 2, … , ) and
(1, 2, … , ) is
• , = 1 − 1 2 + 2 − 2 2 + ⋯ + −
2
• = − = ( − )′( − )
48
Distance
• When variables have different variances, the Euclidian distance is not appropriate
• Statistical distance: Sometimes (if the coordinate variables very independently ) statistical distance is a
more appropriate measure of distance
• Statistical distance between the points (0,0, … , 0) and (1, 2, … , ) is
• , =
1
11
2
+
2
22
2
+ ⋯ +
2
• Statistical distance between the points (1, 2, … , ) and (1, 2, … , ) is
• , =
1−1
11
2
+
2−2
22
2
+ ⋯ +
−
2
• In general, statistical distance distance between the points (1, 2, … , ) and (1, 2, … , ) for
situations in which the variables are correlated has the general form
• , = − ′( − ), where is a symmetric positive definite matrix (i.e. ′ > 0, ∀(≠
0)).
• Euclidian distance is the special case of this when = .
• If 11 = 22 = ⋯ = , then Euclidian distance is appropriate.
49
Distance
• Note: These distance measures satisfy the following properties:
• , = ,
• , > 0 if ≠
• , = 0 if =
• , ≤ , + , for any three point , , (triangle inequality)
50
Distance
• Note: For = 2 all the points (1, 2) with a constant Euclidian
distance from the origin is a circle.
• E.g. All points with 2 , = 1
2 + 2
2 = 2 is a circle of radius c.
• For = 2 all the point (1, 2) with a constant Statistical distance
from the origin is a ellipse.
• E.g. Points with 2 , =
1
2
11
+
2
2
22
= 2
51