STA238 test review document
This document includes the code and questions for our class review of basics in tidyverse coding, graphi-
cal/numerical analyses, CLT, estimation, bootstrapping, likelihoods, and testing.
Other information for preparing
On the Test Information Page there are: Previous tests.
Feeling distressed? Here are some great resources if you need help.
Learning checklist
Getting started with R; Describing distributions; Calculating and interpreting
numerical summaries (week 1)
• Understand how this course works, including how you will be assessed and what to do if you miss an
assessment
• Type and run R code in Jupyterhub and produce a pdf document by knitting an RMarkdown document
• Identify the type of a variable (continuous numerical / discrete numerical / nominal categorical /
ordinal categorical)
• Describe the distribution of categorical and numerical variables based on data visualizations
• Create visualizations using ggplot2 to explore the distributions of categorical and numerical variables
Central Limit Theorem (week 2)
• Should be familiar with the Law of Large Numbers.
• Use Chebvyshev’s Inequality.
• Be able to identify the distribution of the sample mean (or sum or proportion).
• Be able to calculate probabilities and percentiles of the sample mean (or sum or proportion).
Models (week 3)
• Understand what iid means.
• Should understand what a model distribution, model parameters and estimators are.
• Be able to determine if an estimator is unbiased or biased for θ.
• Familiar with plotting and interpretting scatterplots.
• Write out a simple linear regression model in mathematical symbols.
• Run a simple linear regression model.
• Familiar with response/dependent and explanatory/predictor/independent variables.
1
Comparing estimators (week 4)
• Estimator properties.
• Comparing estimators via unbiasedness, variance, and MSE.
Bayesian (week 5)
• You should be familiar with the terminology of prior and posterior.
• Should be able to derive the posterior for some given prior and likelihood function.
• Should be able to comment on the relationship between the posterior and the prior/data.
• How does the posterior change if n increases? What about if the data is far from the prior?
• What does the prior mean/represent?
• What does the posterior mean/represent?
Bootstrapping (week 6)
• Distinguish between the distribution of a variable in the population or based on a sample and sampling
distributions of statistics.
• Predict the effect of sample size (n) and number of repetitions on the centre, shape, and spread of the
sampling distribution of sample means.
• Estimate sampling distributions of statistics by selecting many samples of the same size from the
population or by drawing many bootstrap samples from the original sample.
• Explain the purpose of the bootstrap method and recognize applications where this method might be
useful.
• Describe the steps required to obtain a bootstrap sampling distribution.
• Use R to compute bootstrap confidence intervals for parameters (e.g, µ, p, median, etc.).
• Recognize the connection between confidence levels and the widths of confidence interval estimates.
• Distinguish between correct and incorrect interpretations of confidence intervals.
Maximum Likelihood (week 7)
• Distinguish between the likelihood function and probability.
• Calculate the likelihood function for a given iid sample with known pdf.
• Calculate the loglikelihood function.
• Evaluate the likelihood, loglikelihood or ratio, based off some given inputs.
• Be able to conceptualize how to evaluate the MLE based off a simulation and the likelihood functions
evaluated at each estimator.
• Derive the Maximum Likelihood estimator.
• Perform the second derivative test in order to ensure that the MLE is a true maximum.
Confidence Intervals (week 8)
• Identify the distribution of the sample mean if the population variance is known and the data is normal.
• Identify the distribution of the sample mean if the population variance is unknown and the data is
normal.
• Be able to find probability and quantiles of the Z and t distributions using qnorm() and qt().
• Be able to calculate the critical value if given a confidence level.
• Appropriately distinguish between which CI formula.
• Be able to calculate a CI for the mean.
• Identify assumptions/criteria needed to calculate a CI.
• Bootstrap CIs for statistics other than the mean, median, proportion, sum.
2
Hypothesis and GoF Tests (week 9)
• Use R to conduct a statistical test (i.e., hypothesis test) for one mean.
• Assess evidence against an assumption (i.e., the null hypothesis) and write a conclusion for a hypothesis
test based on a p-value.
• Distinguish between correct and incorrect conclusions of hypothesis tests.
• Explain how test statistic values can be calculated under the null hypothesis.
• Interpret the results of a hypothesis test by making conclusions based on the p-value and a significance
level.
• Be able to find critical values of the chi-squared distribution.
• Recognize correct and incorrect descriptions of “p-value”.
• Distinguish between type 1 and type 2 errors in different contexts.
3
Practice questions
library(tidyverse)
## Warning: package ’tidyverse’ was built under R version 4.0.3
## Warning: package ’tibble’ was built under R version 4.0.3
## Warning: package ’readr’ was built under R version 4.0.3
knitr::opts_chunk\$set(message = FALSE)
The dpylr package has some data about Star Wars characters. Let’s assume it is a representative sample
of all characters seen in Episodes 1 to 9.
starwars<-starwars
glimpse(starwars)
## Rows: 87
## Columns: 14
## \$ name "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia...
## \$ height 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180...
## \$ mass 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, ...
## \$ hair_color "blond", NA, NA, "none", "brown", "brown, grey", "brown"...
## \$ skin_color "fair", "gold", "white, blue", "white", "light", "light"...
## \$ eye_color "blue", "yellow", "red", "yellow", "brown", "blue", "blu...
## \$ birth_year 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57....
## \$ sex "male", "none", "none", "male", "female", "male", "femal...
## \$ gender "masculine", "masculine", "masculine", "masculine", "fem...
## \$ homeworld "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan",...
## \$ species "Human", "Droid", "Droid", "Human", "Human", "Human", "H...
## \$ films [<"The Empire Strikes Back", "Revenge of the Sith", "Re...
## \$ vehicles [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, ...
## \$ starships [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced ...
## # A tibble: 6 x 14
## name height mass hair_color skin_color eye_color birth_year sex gender
##
## 1 Luke~ 172 77 blond fair blue 19 male mascu~
## 2 C-3PO 167 75 gold yellow 112 none mascu~
## 3 R2-D2 96 32 white, bl~ red 33 none mascu~
## 4 Dart~ 202 136 none white yellow 41.9 male mascu~
## 5 Leia~ 150 49 brown light brown 19 fema~ femin~
## 6 Owen~ 178 120 brown, gr~ light blue 52 male mascu~
## # ... with 5 more variables: homeworld , species , films ,
## # vehicles , starships
4
Question 1
my_sw_data <- starwars %>%
mutate(size = case_when
(height > 200 ~ "Tall",
height <=200 & height > 160 ~ "Medium",
TRUE ~ "Small" )) %>%
select(name, gender, height, size, mass, species)
What does this code do? (Just the part starting with my_sw_data.)
A. Takes the starwars dataset, filters out short characters and selects levels of the name variable that are
equal to “gender”, “height”, “size”, “mass” or “species”.
B. Takes the starwars dataset, makes a new variable called size with the levels “Small”, “Medium” and
“Large” based off heights, and then selects only the variables name, gender, height, size, mass and species
to be in a new tibble called “my_sw_data”.
C. Takes the starwars dataset, replaces any missing values for size with “Small” and only the variables
name, gender, height, size, mass and species to be in a new tibble called “my_sw_data”.
D. Takes the starwars dataset, makes a new variable called height with the levels “Small”, “Medium” and
“Large” based off size and mass, and then selects only the variables name, gender, height, size, mass and
species to be in the original tibble called “starwars”.
5
Question 2
my_sw_data %>%
filter(!is.na(height)) %>%
filter(gender %in% c("feminine", "masculine")) %>%
ggplot(aes(height)) +
geom_histogram(bins=30) +
facet_wrap(~gender) ## You don’t need to know this line for the test
feminine masculine
100 150 200 250 100 150 200 250
0.0
2.5
5.0
7.5
10.0
12.5
height
co
u
n
t
Which features should you talk about to compare the above distributions?
A. Pattern, strength, direction
C. Strength, mean, range
D. Size, length, dimensions
6
I want to know a plausible range of values for the mean height of feminine characters in Star Wars Episodes
1 to 9.
my_sw_data %>%
filter(!is.na(height)) %>%
mutate(gender = case_when(
gender=="masculine"~"masculine",
gender=="feminine"~"feminine",
TRUE~"none")) %>%
select(name, gender, height, mass, species) %>%
group_by(gender) %>%
summarise(n=n(),
mean_height=mean(height),
median_height=median(height))
## # A tibble: 3 x 4
## gender n mean_height median_height
## *
## 1 feminine 16 165. 166.
## 2 masculine 62 177. 183
## 3 none 3 181. 183
fem_heights <- starwars %>%
filter(gender == "feminine", !is.na(height))
glimpse(fem_heights)
## Rows: 16
## Columns: 14
## \$ name "Leia Organa", "Beru Whitesun lars", "Mon Mothma", "Shmi...
## \$ height 150, 165, 150, 163, 178, 184, 157, 170, 166, 165, 168, 2...
## \$ mass 49.0, 75.0, NA, NA, 55.0, 50.0, NA, 56.2, 50.0, NA, 55.0...
## \$ hair_color "brown", "brown", "auburn", "black", "none", "none", "br...
## \$ skin_color "light", "light", "fair", "fair", "blue", "dark", "light...
## \$ eye_color "brown", "blue", "blue", "brown", "hazel", "blue", "brow...
## \$ birth_year 19, 47, 48, 72, 48, NA, NA, 58, 40, NA, NA, NA, NA, NA, ...
## \$ sex "female", "female", "female", "female", "female", "femal...
## \$ gender "feminine", "feminine", "feminine", "feminine", "feminin...
## \$ homeworld "Alderaan", "Tatooine", "Chandrila", "Tatooine", "Ryloth...
## \$ species "Human", "Human", "Human", "Human", "Twi’lek", "Tholothi...
## \$ films [<"The Empire Strikes Back", "Revenge of the Sith", "Re...
## \$ vehicles ["Imperial Speeder Bike", <>, <>, <>, <>, <>, <>, <>, <...
## \$ starships [<>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>, <>...
Question 3
How many rows are in the fem_heights dataset?
Question 4
How many columns are in the fem_heights dataset?
7
Question 5
Which of the following are always true, if we do enough bootstrap resamples?
1. The mean of the bootstrap sampling distribution will be approximately the test statistic from our
sample.
2. The statistic of interest calculated for the bootstrap sampling distribution will be exactly the same as
the test statistic from our sample.
3. The mean of bootstrap sampling distribution will be approximately median from our sample.
A. Only 1.
B. Only 1 and 2.
C. Only 3.
D. None of these statements.
set.seed(42)
repetitions <- 5000
store_sims <- rep(NA, repetitions)
for(i in 1:repetitions){
store_sims[i] <- fem_heights %>%
sample_n(size = nrow(fem_heights), replace=TRUE) %>%
summarise(x = mean(height)) %>%
as.numeric()
}
store_sims <- tibble(store_sims)
store_sims %>%
ggplot(aes(x = store_sims)) +
geom_histogram(bins = 20, color = "black", fill = "grey")
0
200
400
600
800
140 150 160 170 180
store_sims
co
u
n
t
8
quantile(store_sims\$store_sims,
probs = c(0.005, 0.01, 0.025, 0.05, 0.2, 0.25, 0.5,
0.75, 0.8, 0.95, 0.975, 0.99, 0.995))
## 0.5% 1% 2.5% 5% 20% 25% 50% 75%
## 149.3103 150.5619 152.8750 154.7500 160.1250 161.1875 165.0000 168.6875
## 80% 95% 97.5% 99% 99.5%
## 169.5000 173.3125 174.8141 176.8756 178.3763
Question 6
Which of these is a correct interpretation of the CI (149.3, 178.4)?
A. We are 99% certain that each feminine character in Episodes 1 to 9 was between 149.3 and 178.4 cm tall.
B. We expect 99% of feminine characters in Episodes 1 to 9 to be between 149.3 and 178.4 cm tall.
C. We are 98% confident that the true median height is of feminine characters is between 149.3 and 178.4cm
tall. D. We are 99% confident that the true mean height of feminine characters in Episodes 1 to 9 is between
149.3 and 178.4 cm.
Question 7
True or False:
It is possible that the 80% CI is (150, 177).
9
my_sw_data %>%
filter(!is.na(mass)) %>%
mutate(gender = case_when(
gender=="masculine"~"masculine",
gender=="feminine"~"feminine",
TRUE~"none")) %>%
select(name, gender, height, mass, species) %>%
group_by(gender) %>%
summarise(n=n(),
mean_mass=mean(mass),
median_mass=median(mass),
sd_mass=sd(mass))
## # A tibble: 3 x 5
## gender n mean_mass median_mass sd_mass
## *
## 1 feminine 9 54.7 55 8.59
## 2 masculine 49 106. 80 185.
## 3 none 1 48 48 NA
Question 8
I want to test if the mean mass of all feminine characters is different from 50. What is the test statistic of
this test?
A. t = 1.641
B. Z = 4.924
C. t = 4.811
D. Z = −1.641
E. t = 14.433
Question 9
I want to test if the mean mass of all feminine characters is different from 50. What is the p-value of this
test?
A. <0.001
B. 0.4652
C. 0.9303
D. 0.0697
E. 0.1393
10
my_sw_data %>%
filter(!is.na(mass)) %>%
mutate(gender = case_when(
gender=="masculine"~"masculine",
gender=="feminine"~"feminine",
TRUE~"none")) %>%
select(name, gender, height, mass, species) %>%
group_by(gender) %>%
summarise(n=n(),
mean_mass=mean(mass),
median_mass=median(mass),
sd_mass=sd(mass))
## # A tibble: 3 x 5
## gender n mean_mass median_mass sd_mass
## *
## 1 feminine 9 54.7 55 8.59
## 2 masculine 49 106. 80 185.
## 3 none 1 48 48 NA
Question 10
Let’s do something a bit more mathematical. Assume that the mass of females are iid and exponentially
distributed with pdf f(x) = λe−λx, x > 0. Use the data to estimate θ = ln(λ) by finding the MLE of θ and
then evaluating it based off the sample. What is the MLE evaluation of θ?
A. 4.002
B. -4.002
C. 1.019
D. 5.7×1023
E. We cannot answer this based on the given information.
11