STA304H1F/1003HF Winter 2021 Assignment # 2
Posted: Friday, February 26, 2021
Due: Online into Quercus Assignment 2 by 8pm on Monday, March 8, 2021
Note: E-mail submissions will NOT be accepted. Late assignments will be accepted but subject to a 1%
penalty of the total assignment marks per hour late. Submission will not be allowed beyond 48 hours of
the due date.
Students who would like additional accommodations should email the instructional team at email@example.com
at least 48 hours before the assignment is due.
• Answer all two (2) questions of this assignment.
• Each assignment should be written up independently. If you work with other students on Question 1,
indicate the names of the students on your solutions. Question 2 should contain unique answers.
• Presentation of solutions is important. Assignments should be word-processed and presently neatly.
• Use proper statistical terminology and proper English language.
• Supporting output, such as unrequested R codes and extraneous output are optional. However, if you
choose to include these, please place in a separate appendix at the end of your assignment.
• Compile your entire solution, including your Appendix, as a PDF or Word (LATEX or Rmarkdown
can be your base). Submit your assignment as PDF or Word file into the Quercus Assignment
named ‘Assignment 2’.
Grading: The grand total is 33 marks which includes 3 marks for excellent presentation. A general
marking scheme for most parts is given below:
Per Question Part
• 3 points: Complete, correct and clearly written
answers. Answers model individual prepara-
tion and academic honesty (where applicable).
• 2 points: Good answers that are unclear, con-
tain few mistakes or missing components. An-
swers demonstrate some individual prepara-
tion and some academic honesty (where ap-
• 1 points: Poor answers or many missing com-
ponents. Most answers do not demonstrate
individual preparation or academic honesty
• 0 points: Missing or incomprehensible answers.
Answers are not academically integral.
• 3 points: well presented, easy to read, proper
English used, R code shown only where re-
• 2 points: good presentation, some unnecessary
R codes and unformatted output
• 1 point: poor presentation, handwritten, hand-
drawn diagrams, unnecessary R codes and un-
• 0 point: illegible, missing, unclear presentation
1. (10 marks) Consider the Mainstreet Research Survey report from December 8, 2020 found here
at the following link- https://www.mainstreetresearch.ca/poll/ontario-survey-doug-fords-handling-
(a) (1 mark) Choose one of the survey questions and identify one parameter of interest.
(b) Based on the relevant cross tabulation table below your selected survey, choose one stratification
variable and show the following:
i. (3 marks) use weighted frequency to compute an estimate of your population parameter,
and place a bound on the error of estimation, and
ii. (3 marks) use unweighted frequency to compute an estimate of your population parameter,
and place a bound on the error of estimation.
(c) (3 marks) Compare the two estimates in part (b) above. Explain which is a post-stratified
2. (20 marks) Consider the baseball dataset describing the population of baseball players in the data
file baseball.csv. Once, at the beginning of your R coding, set the seed of your random-
ization to be the last 4 digits of your student number.
The R package- ‘sampling’, which includes the functions- strata and getdata, is useful for this ques-
tion. The following R codes show how to install and load the package.
#load sampling package, to use the functions- strata and getdata
(a) (3 marks) Take a stratified random sample of 150 players, using proportional allocation with
the different teams as strata (teams are in column 1 of the data file). Describe how you selected
the sample. Show your R codes used to obtain your stratified sample.
(b) (3 marks) Find the mean of the variable logsal = ln(salary), using your stratified sample, and
give a 95% CI.
(c) (3 marks) Estimate the proportion of players in the data set who are pitchers, using your
stratified sample, and give a 95% CI.
(d) (3 marks) Take a simple random sample of 150 players and repeat part (c). How does your
estimate compare with that of part (c).
(e) (3 marks) Examine the sample variances of logsal in each stratum. Do you think optimal
allocation would be worthwhile for this problem?
(f) (5 marks) Using the sample variances from (e) to estimate the population stratum variances,
determine the optimal allocation for a sample in which the cost is the same in each stratum
and the total sample size is 150. How much does the optimal allocation differ from proportional
allocation for this scenario?