R代写-1F/1003HF

UNIVERSITY OF TORONTO
Faculty of Arts and Science
STA304H1F/1003HF FALL 2015 MIDTERM TEST #2 SOLUTIONS
November 25, 2015 Duration- 50 minutes
Aids: Two-sided handwritten notes (8 1/2 x 11) and a non-programmable calculator.
Instructions: This test consists of 4 questions on 7 pages. Please answer all questions on the question
paper, showing all your work and using proper English. The maximum mark for this test is 50.
1. (9 marks) A auditor is confronted with a long list of accounts receivable for a firm. She must verify the
amounts on 10% of these accounts and estimate the average difference between the audited and book
values.
(a) (3 marks) Suppose the accounts are arranged chronologically (according to their dates), with the
older accounts tending to have smaller values. Would systematic or random sampling be preferred?
Explain briefly.
In this case systematic sampling would be preferred, as the population is ordered. [2]
Thus, the variance of an estimate from a systematic sample would be expected to be
smaller. [1]
OR: A systematic sample would give a better representation of the population...
OR: (Any other sound reason)
(b) (3 marks) Suppose the accounts are grouped by department, and then listed chronologically within
departments. The older accounts again tend to have smaller values. Would systematic or random
sampling be preferred? Explain briefly.
In this case (simple) random sampling would be preferred. [1]
Because the accounts are ordered within departments, the population behaves more
like a periodic population. [2]
OR: The population will have a cycle (large to small to large) along the list, so sys-
tematic sampling could be biased and collect all large or small accounts.
OR: Use stratified random sampling, with departments as strata. Within each stra-
tum, we can use simple random sampling, or systematic sampling- to take advantage
of the chronology. [3]
OR: Use repeated systematic sampling to overcome the periodicity.[3]
OR: (Any other sound reason)
(c) (3 marks) Which of the following three estimation methods do you think is most appropriate to
estimate the desired population mean- ratio estimation, regression estimation or difference estima-
tion? Explain.
In this case, difference estimation is most appropriate, [1]
since audited and book values are highly correlated and both are measured on the
same scale. [1]
It is easier than regression estimation since the regression coefficient is set to one. [1]
AND/OR Compared to ratio estimation, we would not necessary have that there is
regression through the origin and the aim is to find difference rather than ratio. [1]
Page 1 of 7
2. (16 marks) A forest resource manager is interested in estimating the number of dead fir trees in a
300-acre area of heavy infestation. Using an aerial photo, he divides the area into 200 plots, each of 1.5
acres. Let x denote the photo count of dead firs and y the actual ground count for a simple random
sample of n = 10 plots. The total number of dead fir trees obtained from the photo count is τx = 4200.
The sample data is shown in the table and plotted in the figure below.
Plot sampled 1 2 3 4 5 6 7 8 9 10
Photo count 12 30 24 24 18 30 12 6 36 42
Ground count 18 42 24 36 24 36 14 10 48 54
(Note: considerations were made for the typo corrected in the above table for the ground count of the
8th plot sampled.)
5 10 15 20 25 30 35 40
10
20
30
40
50
photo
gr
ou
nd
(a) (4 marks) Construct a ratio estimate of the total number of dead firs in the 300-acre area. omit-
Place a bound on the error of estimation.
r =

=
30.6
23.4
= 1.307692
τˆy = rτx = 1.307692(4200) = 5492.31
Hence, a ratio estimate of the number of dead firs in the 300-acre plot is 5492.31 trees.
(This question #2 continues on the next page.)
2
(b) (4 marks) The model yi = α+β(xi− x¯) was fitted to this data and some related R output appears
below. The estimates of α and β were 30.60 and 1.26 respectively, to 2 decimal places. Construct
a regression estimate for the total number of dead firs. Place a bound on the error of estimation.
> reg_model= lm(ground~ I(photo - mean(photo)))
> summary(reg_model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.6000 1.1507 26.59 4.30e-09 ***
I(photo - mean(photo)) 1.2594 0.1057 11.91 2.27e-06 ***
Residual standard error: 3.639 on 8 degrees of freedom
F-statistic: 141.9 on 1 and 8 DF, p-value: 2.269e-06
> mean(ground)
[1] 30.6
> mean(photo)
[1] 23.4
> sum(residuals(reg_model)^2)/8
[1] 13.24012
µˆyL = 30.6 + (1.2594)
(4200
200
− 23.4
)
τˆyL = NµˆyL = 200(27.57744) = 5515.50
A regression estimate of the total number of dead fir trees is 5515.5 trees.
A bound on the error of estimation is found by B = 2(200) ∗

(1− 10200) ∗ 13.2401210 = 448.6
(c) (3 marks) Do you think that regression estimation is better than ratio estimation for this problem?
Explain.
Not necessarily. We expect that there is regression through the origin. (For reference,
see alternative R regression output.)
Call:
lm(formula = ground ~ photo)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1307 2.7286 0.414 0.689
photo 1.2594 0.1057 11.911 2.27e-06 ***
Residual standard error: 3.639 on 8 degrees of freedom
F-statistic: 141.9 on 1 and 8 DF, p-value: 2.269e-06
3
(d) (i) (1 mark) Compute an estimate of the total number of dead firs using the ground count
data only.
Ny¯=200(30.6)=6120
(ii) (2 marks) Is your estimator in (i) unbiased or biased? Explain. (No calculations necessary.)
It is unbiased since y¯ is an unbiased estimator of µy and τˆ is a linear function of
y¯.
(iii) (2 marks) Do you expect that your estimator in (i) will be more efficient than the ratio esti-
mator in part (a)? Explain. (No further calculations necessary.)
No, since photo count and ground count are strongly positively correlated, we
expect that the ratio estimator will be more precise.
4
3. (12 marks) Define any three of the following terms, and illustrate each with an example:
(a) observational study (b) margin of error (c) model-based estimation
(d) post-stratification (e) two-stage cluster sampling (f) probability sampling
(g) standard error (h) repeated systematic sampling (i) unbiased estimator
[Any three; 4 marks each (2 for definition and 2 for example). Examples will vary.]
(a) An observational study draws inferences about the effect of an “exposure” where the
assignment of subjects to groups is observed rather than manipulated by the investi-
gator.
Eg.: A study of the risk of developing lung cancer between smokers and non-smokers.
(b) Commonly, the margin of error is half the length of a confidence interval or the same
as the bound on the error of estimation. It describes the precision of the estimator.
Eg: Using y¯ to estimate µ, a margin of error is ±2SE(y¯)
(c) In model-based estimation, a model motivates the form of the estimator and how
variability is estimated. This is in contrast to design-based estimation where sampling
variability is determined by the sampling design.
Eg.: In model-based estimation, the variance is the average squared deviation of the
estimate from its expected value over all possible samples that could be generated from
the population model.
(d) Post-stratification is a way of improving an estimator after a simple random sample is
collected, by stratifying on an important auxiliary variable.
Eg.: to estimate average weight of a human population, we might stratify our random
sample by sex after collecting our sample, so that we adjust for possible imbalance in
the observed sample.
(e) A two-stage cluster sample is obtained by first selecting a probability sample of clus-
ters (primary sampling units) and then selecting a probability sample of elements
(secondary sampling units) from each sampled cluster.
Eg.: In a city ward, suppose we are interested in grade 3 performance. Schools can
be considered as psu’s and students within selected schools as ssu’s. A 2-stage clus-
ter sample can be conducted by randomly selecting schools within the ward and then
randomly selecting grade 3 students in the selected schools.
(f) In a probability sample, each unit in the population has a known probability of selec-
tion, and a random number table or other randomization mechanism is used to choose
the specific units to be included in the sample.
Eg.: SRS is the simplest form of probability sampling.
(g) Standard error is the standard deviation of (the sampling distribution of) a sample
statistic.
Eg: For y¯ from a SRS, the standard error is σ/

n
(h) Repeated systematic sampling involves taking several systematic samples to makeup
the entire sample.
Eg.
(i) An estimator θˆ of a population parameter, θ is unbiased if its expectation equals the
parameter, i.e., E(θˆ) = θ.
Eg: The sample mean y¯ is unbiased for the population mean µ.
5
4. (a) (9 marks) A language school owner takes an SRS of 10 of the 72 Introductory Spanish classes
offered by the school. Each student in each of the sampled classes is asked whether he or she is
planning a trip to a Spanish-speaking country in the next year. (Note: total marks for this part (a)
corrected to 9 marks.)
i. (3 marks) Describe why this is a one-stage cluster sampling design. What is the primary sam-
pling unit? What is the secondary sampling unit?
The primary sampling units are classes and the secondary sampling units are stu-
dents within the classes. [2]
This is a one-stage cluster sample since classes are randomly selected and each
student within the selected class is surveyed. [1]
ii. A. (5 marks) Suppose the owner wanted to estimate the total number of students planning a
trip to a Spanish speaking country in the next year, of the students in the 72 Introductory
Spanish classes. Using data from the 10 randomly selected classes, describe formulas for a
ratio estimator and an unbiased estimator, that the owner can use to estimate the total.
Use the notation:
• M=total number of students in the school
• N= total number of Introductory Spanish classes offered by the school
• n=the number of classes selected
• mi=number of students in the ith class, i = 1, . . . , N
• yi=total number of students in ith class who are planning a trip to a Spanish-speaking
country;
however, specify their values where possible.
B. (1 mark) Under what conditions would the two estimators (the ratio estimator
and the unbiased estimator in ii. A.) be equivalent?
A. (2 marks for each formula)
Ratio estimator Unbiased estimator
τˆy = M
∑n
i=1 yi∑n
i=1mi
τˆy = N
∑n
i=1 yi
n
where N = 72, n = 10 and M is unknown. [1]
B. The two estimators will be equivalent when cluster sizes are the same, i.e.,
m1 = m2 = . . . = mn = m.
6
(b) (4 marks) Under what conditions does cluster sampling produce a smaller bound on the error of
estimation for the population mean than simple random sampling? Explain.
Cluster sampling produces estimates of better precision than simple random sampling
when the clusters are heterogeneous within [2], with respect to the measurement of
interest, and have similar cluster means as we move from one cluster to another [2].
END OF TEST
Q1 Q2 Q3 Q4 Total
9 16 12 13 50
7