6PART 2
You will need to use R/RStudio for this part of the exam.
For this part of the exam, you will analyze the dataset seeds.csv that contains the measurements of
geometrical properties of kernels belonging to three di↵erent varieties of wheat: Kama, Rosa, and Canadian.
These were obtained using a high quality visualization via a soft X-ray technique. To download and load
the data into your RStudio environment use:
seeds <- read.csv(’https://raw.githubusercontent.com/kejzlarv/Teaching/main/seeds.csv’)
This will create a data frame seeds in your RStudio environment that contains the following variables.
The geometrical measurements are all real-valued and continuous measured in millimeters:
• Area
• Perimeter
• Compactness: Computed as 4⇡ ⇥ Area/Perimeter2
• Kernel.Length
• Kernel.Width
• Asymmetry.Coeff: Asymmetry coecient
• Kernel.Groove: Length of kernel groove
• Type: Categorical variable taking values 1,2, or 3 based on the wheat variety (1 = Kama, 2 = Rosa,
3 = Canadian)
Note: We will not make the use of wheat types in the majority of what follows. But after
performing clustering, we will check to see the extent to which these wheat types agree with
the results of clustering
(1) [1 point] Compute the covariance matrix of the 7 geometrical measurements (Without the Type!)
and comment on the need of standardizing the data before clustering.
(2) [1 point] Plot the scatterplot matrix of the 7 geometrical measurements with histograms on the
diagonals and scatterplots everywhere else.
(3) [2 points] Plot the bivariate kernel density estimate of Perimeter versus Asymmetry.Coeff. What
does this density estimate suggest?
(4) Principal Component Analysis.
(a) [2 points] Carry out the PCA of the 7 geometrical measurements and provide a biplot with
the shape of points corresponding to the Type.
(b) [2 points] Use the biplot from (a) to characterize the Kama, Rosa, and Canadian wheat types
based on the geometrical measurements of their kernels.
(5) Agglomerative hierarchical clustering.
(a) [2 points] Construct dendrograms based on the complete and average linkage clustering of the
7 geometrical measurements.
(b) [1 point] At what heights do you need to cut these dendrograms in order to partition the data
into 3 clusters?
(c) [3 points] Which of these two clusterings, assuming that partitioning into 3 clusters is appro-
priate, agrees more with the partitioning of data according to the wheat type? Make sure to
provide a clear justification as a statistician/data analyst.
(6) K-means clustering.
(a) [3 points] Use the K-means clustering on the 7 geometrical measurements. What is the appro-
priate number of clusters? Justify your answer with an appropriate plot.
7(7) Finite mixture density model.
(a) [1 point] Use the finite mixture density model to cluster the 7 geometrical measurements.
(b) [1 point] Which model (including the number of clusters) provides the best clustering? How
are you deciding?
(8) Extra credit: [3 points] Compare the results of Agglomerative hierarchical clustering, K-means
clustering, and model based clustering. Which of these procedures provides the best agreement with
the partitioning of data according to the wheat type? Make sure to provide a clear justification as
a statistician/data analyst.