python代写-STAT 3010|学霸联盟

python代写-STAT 3010

时间：2022-04-22

Project Example: Classifiying Flower
STAT 3010
3/21/2022
Contents
Introduction 2
Data Structure and Visualisation 2
Summary by Species . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Scatter plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Overall Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
If we only had one? ANOVA 7
Test for Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Test for Equal Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Multiple Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Conclusion 10
library(tidyverse)
library(psych)
library(knitr)
library(cowplot)
library(GGally)
library(rstatix)
1
Introduction
Figure 1: Setosa
In this project we are going to try and determine the physical differences between three sub-species of Iris
using measurement from their flowers. The both length and width of the sepal and the petal of 50 flowers of
each species was measured and record in this study.
Data Structure and Visualisation
General summary
kable(summary(iris)) #overall summary
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 NA
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 NA
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 NA
db <- describeBy(iris,group=iris$Species) #Summaries by Group
Summary by Species
Setosa
kable(db$setosa)
vars n mean sd mediantrimmed mad min max range skew kurtosis se
Sepal.Length 1 50 5.006 0.3524897 5.0 5.0025 0.29652 4.3 5.8 1.5 0.1129778 -
0.4508724
0.0498496
Sepal.Width 2 50 3.428 0.3790644 3.4 3.4150 0.37065 2.3 4.4 2.1 0.03872950.5959507 0.0536078
Petal.Length 3 50 1.462 0.1736640 1.5 1.4600 0.14826 1.0 1.9 0.9 0.10009540.6539303 0.0245598
2
vars n mean sd mediantrimmed mad min max range skew kurtosis se
Petal.Width 4 50 0.246 0.1053856 0.2 0.2375 0.00000 0.1 0.6 0.5 1.17963281.2587179 0.0149038
Species* 5 50 1.000 0.0000000 1.0 1.0000 0.00000 1.0 1.0 0.0 NaN NaN 0.0000000
Versicolor
kable(db$versicolor)
vars n mean sd mediantrimmed mad min max range skew kurtosis se
Sepal.Length 1 50 5.936 0.51617115.90 5.9375 0.51891 4.9 7.0 2.1 0.0991393 -
0.6939138
0.0729976
Sepal.Width 2 50 2.770 0.31379832.80 2.7800 0.29652 2.0 3.4 1.4 -
0.3413644
-
0.5493203
0.0443778
Petal.Length 3 50 4.260 0.46991104.35 4.2925 0.51891 3.0 5.1 2.1 -
0.5706024
-
0.1902555
0.0664554
Petal.Width 4 50 1.326 0.19775271.30 1.3250 0.22239 1.0 1.8 0.8 -
0.0293338
-
0.5873144
0.0279665
Species* 5 50 2.000 0.00000002.00 2.0000 0.00000 2.0 2.0 0.0 NaN NaN 0.0000000
Virginica
kable(db$virginica)
vars n mean sd mediantrimmed mad min max range skew kurtosis se
Sepal.Length 1 50 6.588 0.63587966.50 6.5725 0.59304 4.9 7.9 3.0 0.1110286 -
0.2032597
0.0899270
Sepal.Width 2 50 2.974 0.32249663.00 2.9625 0.29652 2.2 3.8 1.6 0.3442849 0.3803832 0.0456079
Petal.Length 3 50 5.552 0.55189475.55 5.5100 0.66717 4.5 6.9 2.4 0.5169175 -
0.3651161
0.0780497
Petal.Width 4 50 2.026 0.27465012.00 2.0325 0.29652 1.4 2.5 1.1 -
0.1218119
-
0.7539586
0.0388414
Species* 5 50 3.000 0.00000003.00 3.0000 0.00000 3.0 3.0 0.0 NaN NaN 0.0000000
Boxplots
p <- ggplot(iris,aes(x=Species,fill=Species))
p1 <- p+geom_boxplot(aes(y=Sepal.Length))+ theme(legend.position = "none")
p2 <- p+geom_boxplot(aes(y=Sepal.Width))+ theme(legend.position = "none")
p3 <- p+geom_boxplot(aes(y=Petal.Length))+ theme(legend.position = "none")
p4 <- p+geom_boxplot(aes(y=Petal.Width))+ theme(legend.position = "none")
plot_grid(p1,p2,p3,p4)
3
56
7
8
setosa versicolor virginica
Species
Se
pa
l.L
en
gt
h
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolor virginica
Species
Se
pa
l.W
id
th
2
4
6
setosa versicolor virginica
Species
Pe
ta
l.L
en
gt
h
0.0
0.5
1.0
1.5
2.0
2.5
setosa versicolor virginica
Species
Pe
ta
l.W
id
th
From the boxplots we can see there are clear difference between the species based on several variables. This
is strong evidence that we will be able to find significant difference the groups
Scatter plots
q <- ggplot(iris,aes(colour=Species))
q1 <-q+geom_point(aes(x=Sepal.Length,y=Sepal.Width))+labs(title="Sepal Length vs Sepal Width")+theme(legend.position = 'None')
q2 <-q+geom_point(aes(x=Petal.Length,y=Petal.Width))+labs(title="Petal Length vs Petal Width")+theme(legend.position = 'None')
plot_grid(q1,q2)
4
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8
Sepal.Length
Se
pa
l.W
id
th
Sepal Length vs Sepal Width
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6
Petal.Length
Pe
ta
l.W
id
th
Petal Length vs Petal Width
Overall Summary
ggpairs(iris,aes(colour=Species))
5
Corr: −0.118
setosa: 0.743***
versicolor: 0.526***
virginica: 0.457***
Corr: 0.872***
setosa: 0.267.
versicolor: 0.754***
virginica: 0.864***
Corr: −0.428***
setosa: 0.178
versicolor: 0.561***
virginica: 0.401**
Corr: 0.818***
setosa: 0.278.
versicolor: 0.546***
virginica: 0.281*
Corr: −0.366***
setosa: 0.233
versicolor: 0.664***
virginica: 0.538***
Corr: 0.963***
setosa: 0.332*
versicolor: 0.787***
virginica: 0.322*
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Sepal.Length
Sepal.W
idth
P
etal.Length
P
etal.W
idth
Species
5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 2 4 6 0.0 0.5 1.0 1.5 2.0 2.5 setosaversicolorvirginica
0.0
0.4
0.8
1.2
2.0
2.5
3.0
3.5
4.0
4.5
2
4
6
0.0
0.5
1.0
1.5
2.0
2.5
0.0
2.5
5.0
7.5
0.0
2.5
5.0
7.5
0.0
2.5
5.0
7.5
6
If we only had one? ANOVA
We wish to find the measurement that is most different among our given Species. So will use Anova to
determine if there are differences in the measures for each Species. But first we check some the assumptions
of the anova.
Test for Normality
iris %>% group_by(Species) %>% shapiro_test(Sepal.Width) # test for normality in each group
## # A tibble: 3 x 4
## Species variable statistic p
##
## 1 setosa Sepal.Width 0.972 0.272
## 2 versicolor Sepal.Width 0.974 0.338
## 3 virginica Sepal.Width 0.967 0.181
Since our p-values are greater the α = 0.05 we fail to reject the null hypothesis and proceed with the
assumption of normality.
g_iris<- iris %>% group_by(Species)
g_iris %>% shapiro_test(Sepal.Length)# test for normality in each group sepal length
## # A tibble: 3 x 4
## Species variable statistic p
##
## 1 setosa Sepal.Length 0.978 0.460
## 2 versicolor Sepal.Length 0.978 0.465
## 3 virginica Sepal.Length 0.971 0.258
g_iris %>% shapiro_test(Petal.Length)# test for normality in each group sepal length
## # A tibble: 3 x 4
## Species variable statistic p
##
## 1 setosa Petal.Length 0.955 0.0548
## 2 versicolor Petal.Length 0.966 0.158
## 3 virginica Petal.Length 0.962 0.110
g_iris %>% shapiro_test(Petal.Width)# test for normality in each group sepal length
## # A tibble: 3 x 4
## Species variable statistic p
##
## 1 setosa Petal.Width 0.800 0.000000866
## 2 versicolor Petal.Width 0.948 0.0273
## 3 virginica Petal.Width 0.960 0.0870
Here we find that the normality assumption is only violated for one of our groups. Since the violation is only
in one group we will proceed.
Test for Equal Variance
bartlett.test(Sepal.Width~Species,iris)
##
## Bartlett test of homogeneity of variances
7
##
## data: Sepal.Width by Species
## Bartlett's K-squared = 2.0911, df = 2, p-value = 0.3515
Since our p-value<α we fail to reject the null and conclude that our variance are not significantly different
from each other.
bartlett.test(Sepal.Length~Species,iris)
##
## Bartlett test of homogeneity of variances
##
## data: Sepal.Length by Species
## Bartlett's K-squared = 16.006, df = 2, p-value = 0.0003345
bartlett.test(Petal.Width~Species,iris)
##
## Bartlett test of homogeneity of variances
##
## data: Petal.Width by Species
## Bartlett's K-squared = 39.213, df = 2, p-value = 3.055e-09
bartlett.test(Petal.Length~Species,iris)
##
## Bartlett test of homogeneity of variances
##
## data: Petal.Length by Species
## Bartlett's K-squared = 55.423, df = 2, p-value = 9.229e-13
He we see heavy violations of the equality of variance since our ni large we can proceed with our anova.
ANOVA
summary(aov(Sepal.Width~Species,data=iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 11.35 5.672 49.16 <2e-16 ***
## Residuals 147 16.96 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our anova show us that there are statistically significant difference between our groups
summary(aov(Sepal.Length~Species,data=iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 63.21 31.606 119.3 <2e-16 ***
## Residuals 147 38.96 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(Petal.Length~Species,data=iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.1 218.55 1180 <2e-16 ***
## Residuals 147 27.2 0.19
## ---
8
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(aov(Petal.Width~Species,data=iris))
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 80.41 40.21 960 <2e-16 ***
## Residuals 147 6.16 0.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Just as we suspected when we observed the boxplots all our variable have difference between groups. We
therefore proceed to perform multiple comparison testing.
Multiple Comparison
TukeyHSD(aov(Sepal.Width~Species,data=iris),conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Width ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa -0.658 -0.81885528 -0.4971447 0.0000000
## virginica-setosa -0.454 -0.61485528 -0.2931447 0.0000000
## virginica-versicolor 0.204 0.04314472 0.3648553 0.0087802
TukeyHSD(aov(Sepal.Length~Species,data=iris),conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Sepal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 0.930 0.6862273 1.1737727 0
## virginica-setosa 1.582 1.3382273 1.8257727 0
## virginica-versicolor 0.652 0.4082273 0.8957727 0
TukeyHSD(aov(Petal.Length~Species,data=iris),conf.level = .95)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 2.798 2.59422 3.00178 0
## virginica-setosa 4.090 3.88622 4.29378 0
## virginica-versicolor 1.292 1.08822 1.49578 0
TukeyHSD(aov(Petal.Width~Species,data=iris),conf.level = .95)
## Tukey multiple comparisons of means
9
## 95% family-wise confidence level
##
## Fit: aov(formula = Petal.Width ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 1.08 0.9830903 1.1769097 0
## virginica-setosa 1.78 1.6830903 1.8769097 0
## virginica-versicolor 0.70 0.6030903 0.7969097 0
In fact we find that each variable can be distinguished by their species hence we should be able tell which
species or flower is by looking at any of the variables, but since petal length has the largest mean difference it
would be the easiest variable to use.
Conclusion
In our investigation we sort to determine difference between species of iris through a looking at measurements
taken from their flower. We found difference with respect to very measurement that look at but we found the
largest difference to be found within Petal Length. Future work could look at build a model that predicts
the species based on the given measurements and as our investigation shows there are significant differences
between species based on these measurements.
10