4.语言类型代写-QM2|学霸联盟

4.语言类型代写-QM2

时间：2022-05-28

StuDocu is not sponsored or endorsed by any college or university
QM2 All Tutorial Answers (Organized by Week)
Quantitative Methods 2 (University of Melbourne)
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 - Solutions 1
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 1

Solutions

Exercises for Assessment

Exercise 2

One of the major measures of the quality of service provided by any organisation is the
speed with which the organisation responds to customer complaints. Last year the flooring
department of a large family-owned department store received 50 complaints about carpet
installation. The following data represent the number of days between the receipt and
resolution of these complaints.

Days
54 35 29 2 1
11 126 4 35 26
12 165 27 26 74
13 5 29 22 26
33 137 28 123 14
5 110 52 94 20
19 32 152 25 27
4 27 61 36 5
10 31 29 81 13
68 110 30 31 23

a) Is the variable Days qualitative or quantitative? If it is quantitative, is it discrete or
continuous? In addition, determine its level of measurement. Explain your answers.

The observations are numbers of days resulting from a counting process and the possible
values are non-negative integers. Therefore, Days is a quantitative variable, it is discrete
(countable infinite). The measurement scale is ratio since there is a unit of measurement
(day) and a genuine zero point (0 day).

b) Launch RStudio and close the Script tab, if it is open at all. Create a new RStudio project
and script, and name both t1e2.

Follow similar steps than in Exercise 1.

c) Enter the observations from your keyboard to an RStudio spreadsheet and save them
in an RData file. Quit RStudio. When prompted, save only the t1e2.R file.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 - Solutions 1
2
Follow similar steps than in Exercise 1.

d) Open your working directory. Capture your screen by taking a screenshot (Alt + Print
Screen) and paste it with your answers for part (a) in a Word document.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 2

Solutions

Exercises for Assessment

Exercise 4

In this exercise you are going to work on the data you saved in Exercise 2 last week.

a) Launch RStudio and close the Script tab, if it is open. Create a new RStudio project and
script, and name both t2e4. Retrieve the t1e2 data set and save it as t2e4.RData.

You can complete these tasks by following similar steps than in Exercise 2 of Tutorial 2.

The variable of interest, Days, is a discrete quantitative variable. The data set is cross-
sectional and it can be displayed graphically with, for example, a histogram or a boxplot.

b) Use RStudio to illustrate the data on Days with a histogram. Customize your plot as you
did in Exercise 3. Briefly describe what the graph tells you.

A basic histogram is generated by the following command:

hist(Days)

In return, RStudio displays the first plot on the next page. It is black and white and looks a
bit strange because the axes are too short. However, it can be easily improved by adding a
few arguments:

hist(Days,
xlim = c(0,200), ylim = c(0, 25),
col = "yellow")

The new histogram is second on the next page.

These histograms show that the sample data of Days is heavily skewed to the right and that
the second class interval, from 20 to 40, has the highest frequency, 21.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
2

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
3
c) Use RStudio to illustrate the data on Days with a boxplot and customize your plot. Briefly
describe what the graph tells you.

Use the boxplot(Days) command to develop a basic boxplot and then add a main title to it,
add the Days label to the vertical axis, and colour the rectangle on the boxplot red.

A basic boxplot is generated by the

boxplot(Days)

command:

To add the required customization, execute

boxplot(Days,
main = "Boxplot for Days",
ylab = "Days",
col = "red")

The new boxplot is on the next page.

It shows that in the sample of Days, (i) the median (Q2) is a bit above 25, (ii) the first quartile
(Q1) is about 30, (iii) the third quartile (Q3) is a bit above 50, (iv) Q1 – 1.5 (Q3 – Q1) is about
zero, (v) Q3 + 1.5 (Q3 – Q1) is about 110, and (vi) there are a few outliers at the upper end
of the range.1

1 Observations that differ greatly from the majority of the data set in the sense that they are either smaller than
Q1 – 1.5 (Q3 – Q1) or larger than Q3 + 1.5 (Q3 – Q1) are considered to be outliers.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
4
Exercise 5

The table below details the number of international visitors (aged 15 years and over) to
Australia from its top 10 markets during the 2018/19 financial year by country of residence
(COR).2

Overseas arrivals (‘000) by
country of residence (COR)
COR Visitors
China 1331
Hong Kong 284
India 364
Japan 455
Korea 250
Malaysia 344
New Zealand 1276
Singapore 417
UK 670
US 771

2 Source: Estimates for the year ending June 2019 from the International Visitor Survey, Data, Table 1a,
https://www.tra.gov.au/International/International-tourism-results/overview.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
5
a) There are two variables: Market and Visitors. Are they qualitative or quantitative,
discrete or continuous? Explain your answers.

COR is a qualitative variable as its possible values are names / labels. Visitors, i.e. the
number of international visitors aged 15 years to Australia, is a quantitative variable because
the possible values are numbers resulting from a counting process. Originally this variable
is discrete, and its possible values are non-negative integers, but the actual observations
have been rounded to the nearest thousand.

b) Launch RStudio, create a new RStudio project and script (t2e5), enter the observations
from your keyboard to an RStudio spreadsheet and save it as an RData file.

Follow similar steps than in Exercise 1 and Exercise 2 of tutorial 1.

c) Depict the number of visitors by country of residence market with a bar graph.3

Use the barplot(Visitors) command to develop a basic bar graph.

It returns the following plot:
d) Annotate your bar graph with axes labels Country of Residence (x-axis), Visitors to
Australia (y-axis) and with the Bar graph for Visitors to Australia title.

Review the application of the main, ylab and xlab arguments in Exercise 3.

The following command

3 Notice that this time a histogram would be inappropriate because the observations are classified by
categories (countries of origin) rather than adjacent class intervals.
0
20
0
60
0
10
00
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
6
barplot(Visitors,
main = "Bar graph of Visitors to Australia",
xlab = "Country of residence",
ylab = "Number of visitors")

returns

d) Increase the scale on the vertical axis to (0,1400) and colour the bars orange.

Review the application of the ylim and col arguments in Exercise 3.

The following command

barplot(Visitors,
main = "Bar graph of Visitors to Australia",
xlab = "Country of residence",
ylab = "Number of visitors",
ylim = c(0,1400),
col = “orange”)

returns the bar graph shown on the next page.

Bar graph of Visitors to Australia
Country of residence
Nu
mb
er
of
vis
ito
rs
0
20
0
60
0
10
00
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
7

e) To make the bar graph more informative, expand the barplot command with the
names.arg = COR and cex.names = 0.5 arguments.

The expanded command is

barplot(Visitors,
main = "Bar graph of Visitors to Australia",
xlab = "Country of residence",
ylab = "Number of visitors",
ylim = c(0,1400),
col = "orange",
names.arg = COR, cex.names = 0.5)

It returns the bar graph shown on the next page.

f) Briefly describe what the bar graph in part (e) tells you.

This bar graph shows that in 2018/19 the most tourists to Australia arrived from China,
followed by New Zealand, the US and the UK.
Bar graph of Visitors to Australia
Country of residence
Nu
mb
er
of
vis
ito
rs
0
20
0
60
0
10
00
14
00
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
8

Although it was not part of Exercise 5, there is one more thing worth to mention. To make
this bar graph even more informative, it is good idea to display the bars in the descending
order of their heights. Let’s do this in three steps.

First, we set up a data frame called original that consists of COR and Visitors by executing
the

original = data.frame(COR, Visitors)

command.

Second, we rearrange original in the descending order of Visitors and call the new data
frame ordered. The relevant command is

ordered = original[order(-original$Visitors),]

Third, we run the barplot command like in part (e), but on the ordered data frame, i.e.

barplot(ordered$Visitors,
main = "Bar graph of Visitors to Australia",
xlab = "Country of residence",
ylab = "Number of visitors",
ylim = c(0,1400),
col = "orange",
names.arg = ordered$COR, cex.names = 0.5)
China India Korea New Zealand UK US
Bar graph of Visitors to Australia
Country of residence
Nu
mb
er
of
vis
ito
rs
0
20
0
60
0
10
00
14
00
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 2
9
The new bar graph looks like this:

China New Zealand US UK Japan Singapore India Malaysia Hong Kong Korea
Bar graph of Visitors to Australia
Country of residence
Nu
mb
er
of
vis
ito
rs
0
20
0
60
0
10
00
14
00
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 3

Solutions

Exercises for Assessment

Exercise 6 (Selvanathan, p. 397, ex. 10.43)

A parking officer is conducting an analysis of the amount of time left on parking meters. A
quick survey of 15 cars that have just left their metered parking spaces produced the times
(T, in minutes) saved in the t3e6 Excel file. Assuming that the population of T is normally
distributed, estimate with 95% confidence the mean amount of time left for all the vacant
meters. Do the calculations first manually and then with R.

Since the population of T is said to be normally distributed and the population standard
deviation is unknown, the appropriate confidence interval estimator for the mean is

/2 xx t s

Using your hand calculator you can obtain the sample mean and the sample standard
deviation:

18.133 , 9.753x s 

From the sample standard deviation and the sample size the estimate of standard error of
the sample mean is

9.753 2.51815x
s
s
n
  

From the t-table the 97.5th percentile of the t distribution with df = n - 1 = 14 is 2.145.

Putting all these together,

 /2 18.133 2.145 2.518 12.732 ; 23.534xx t s    

Therefore, with 95% confidence, the mean amount of time left for all the vacant meters is
somewhere between 12.732 and 23.534 minutes.

To obtain this confidence interval in R, import the data to RStudio and execute the

t.test(T, mu = 0, conf.level = 0.95)

command, which returns:

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3
2

The 95% confidence interval on this printout confirms our manual calculations.

Exercise 7 (Selvanathan, p. 499, ex. 12.41)

In this exercise do all calculations manually.

a) A random sample of eight observations was taken from a normal population. The sample
mean and standard deviation are 75 and 50, respectively. Can we infer at the 10%
significance level that the population mean is less than 100?

Just like in the previous exercise, we are interested in the mean of an allegedly normally
distributed population whose standard deviation is unknown. This time, however, instead of
developing a confidence interval to estimate the population mean, we need to perform a
hypothesis test. Let’s follow the six-step test procedure.

The hypotheses are1

0: 100 , : 100AH H  

The sample mean is normally distributed, but since its standard error must be estimated
from the sample, the test statistic is

0
x
X
T
s


Under the null hypothesis this test statistic has a t distribution with df = n – 1.

The significance level is 10% and the critical value for this left-tail test is

, 0.10,7 1.415dft t    

and we reject the null hypothesis if the calculated test static happens to be smaller than this
critical value.

The calculated or observed value of the test statistic is

1 It is easier to start with the alternative hypothesis because it is implied by the question. This is usually the
case, except when the implied statement takes the form of an equality that must be in the null hypothesis.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3
3

0 75 100 1.41450 / 8obs x
x
t
s
    

Since the observed value of the test statistic is (slightly) larger than the critical value (-1.415),
we maintain H0 and conclude that at the 10% level there is not enough evidence to infer that
the population mean is smaller than 100.

b) Repeat part (a) assuming that you know that the population standard deviation is 50.

If the population standard deviation is known and it is  = 50, then the test statistic is

0
x
X
Z




The critical value is

0.10 1.282z z    

Although the test statistic is different than in part (a), its calculated value is the same:

0 75 100 1.41450 / 8obs x
x
z


    

Since the observed value of the test statistic is smaller than the critical value (-1.282), we
reject H0 and conclude that at the 10% level there is enough evidence to infer that the
population mean is less than 100.

c) Review parts (a) and (b). Explain why the test statistics differ.

The tests in parts (a) and (b) led to different conclusions. This is due to the fact that in part
(a) we had to use the t distribution, while in part (b) we could use the standard normal
distribution. Both distributions are symmetric around zero, but the t distribution is more
dispersed than the standard normal distribution and hence the critical value in part (a) is
further from zero than in part (b).

Exercise 8

Environmental engineers have found that the percentages of active bacteria in sewage
specimens collected at a sewage treatment plant have a non-normal distribution with a
median of 40% when the plant is running properly. If the median is larger than 40%, some
adjustments must be made. The percentages of active bacteria (PAB) in a random sample
of 10 specimens are saved in the t3e8 Excel file. Do the data provide enough evidence (at
 = 0.05) to indicate that adjustments are needed?

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3
4

a) What are the null and alternative hypotheses?

Unlike the previous exercise, this one is about a population median. The hypotheses are

0: 40 , : 40AH H  

b) Which test(s) can be used to answer this question? What are the required conditions?
Do you think that these conditions are likely satisfied this time? Explain your answer.

We learnt about two nonparametric tests that can be used this time, the one-sample sign
test for the median and the one-sample Wilcoxon signed ranks test for the median.

The sign test assumes that (i) the data is a random sample, (ii) the variable of interest is
qualitative or quantitative, and (iii) the measurement scale is at least ordinal. In this case we
are told that the sample at hand is a random sample. The variable of interest, PAB, is a
quantitative variable measured on a ratio scale. Hence, all three requirements are met.

The Wilcoxon signed ranks test assumes that (i) the data is a random sample, (ii) the
variable of interest is quantitative and continuous, (iii) the measurement scale is interval or
ratio, and (iv) the distribution of the sampled population is symmetric. The first three
requirements are clearly met. As for the fourth one, due to the small sample size it is difficult
to verify it. Let’s just assume at this stage that it is satisfied and see whether the Wilcoxon
signed ranks test leads to same conclusion as the sign test. If it does, then the issue of
symmetricity is irrelevant.

c) Perform the test(s) first manually and then with R. Explain your decision and conclusion.

Sign test:

There are three negative deviations and seven positive deviations, so

3 , 7 10S S n S S        

The test is a right-tail test and the test statistic is S = S+ = 7. From the binomial table of
Selvanathan (Table 1, Appendix B, pp. 1068-1071, n = 10, k = 7, p = 0.5), the probability of
observing at least 7 ‘successes’ in 10 trials is P(S  7) = 1 – P(S  6) = 1 - 0.8281 = 0.1719.
Since this p-value is above the selected significance level ( = 0.05), we maintain the null
hypothesis at the 5% significance level and conclude that there is no need for adjustment.

To repeat this test with R, launch RStudio, create a new project and script (t3e8), import the
t3e3 data from Excel to RStudio, and execute the following commands:

attach(t3e8)
library(DescTools)
SignTest(PAB, mu = 40, alternative = "greater")

PAB 41 33 43 52 46 37 44 49 53 30
DEV = PAB ‐ 40 1 ‐7 3 12 6 ‐3 4 9 13 ‐10
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 3
5
You should get the following printout:

The p-value is 0.1719, larger than 0.05, so at the 5% significance level there is not enough
evidence against the null hypothesis.

Wilcoxon signed ranks test:

T- = 16.5 and T+ = 38.5. Their sum is 55 = (10)(11)/2. The test is a right-tail test and the test
statistic is T = T+ = 38.5. From the Wilcoxon Signed Rank Sum Test table of Selvanathan
(Table 9, Appendix B, p. 1089, Part (b), n = 10) the 5% one-tail critical values are TL = 11
and TU = 44. Since T+ = 38.5 < TU = 44, we maintain the null hypothesis at the 5%
significance level and conclude that there is no need for adjustment.

To repeat this test with R, execute the following commands:

library(exactRankTests)
wilcox.exact(PAB, mu = 40, alternative = "greater")

You should get the following printout:

The p-value is 0.1436, larger than 0.05, so at the 5% significance level there is not enough
evidence against the null hypothesis.

This time neither the sign test nor the Wilcoxon signed ranks test rejects the null hypothesis.
Since they lead to the same conclusion, we do not need to worry about whether the sampled
population is symmetric.

Quit RStudio and save your RData and R files.

PAB 41 33 43 52 46 37 44 49 53 30
DEV = PAB ‐ 40 1 ‐7 3 12 6 ‐3 4 9 13 ‐10
ABSDEV 1 7 3 12 6 3 4 9 13 10
RANK 1 6 2.5 9 5 2.5 4 7 10 8
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 4

Solutions

Exercises for Assessment

Exercise 4 (Selvanathan et al., p. 887, ex. 20.9)

In recent years, insurance companies offering medical coverage have given discounts to
companies that are committed to improving the health of their employees. To help determine
whether this policy is reasonable, the general manager of one large insurance company in
the US organised a study of a random sample of 30 workers who regularly participate in
their company’s lunchtime exercise program and 30 workers who do not. Over a two-year
period, he observed the total dollar amount of medical expenses for each individual. The
data are stored in the t4e4 (column 1: Expenses; column 2: Exercise, yes or no) Excel file.
Do all calculations with R.

a) Can the manager conclude at the 5% significance level that companies that provide
exercise programs should be given discounts? Perform the independent-samples t-test
to answer the question.1 Do not forget to specify the null and alternative hypotheses.

Let XY and XN denote the medical expenses of those who regularly participate in their
company’s lunchtime exercise program and who do not, respectively.

Companies that provide exercise programs should be given discounts if their employees
have smaller average medical expenses than the employees of those companies which do
not provide this facility. Therefore, the question implies the following null and alternative
hypotheses

0 : 0 , : 0Y N A Y NH H      

You might recall, that this test can be performed under three different scenarios depending
on what we know or assume about the population variances. Hence, like in Exercise 2 of
Tutorial 4, you should first consider the sample variances. The

by(Expenses, Exercise, sd)

command returns the sample standard deviations:

Exer c i se: no
[ 1] 271. 6985

1 We use the independent-samples t-test because we have two unrelated (hence, independent) random
samples of workers. However, if the sample consisted of pairs of workers where in each pair the two workers
are employed by the same company and one of them regularly participates in the lunchtime exercise program
while the other one does not, then we should perform the matched-pairs t-test.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
2

Exer c i se: yes
[ 1] 266. 278

They are quite similar, so we can assume that the corresponding population variances are
equal.2

The

t.test(Expenses ~ Exercise, alternative = "less",
var.equal = TRUE, conf.level = 0.95)

command returns

The t-test statistic is 0.85656, i.e. positive. This seems to contradict the alternative
hypothesis. However, as the sample estimates part of these printouts shows, R considers
the No exercise group as group 1 and the Yes exercise group as group 2. Hence, we need
to re-write the hypotheses as

0 : 0 , : 0N Y A N YH H      

This is a right-tail test, so execute

t.test(Expenses ~ Exercise, alternative = "greater",
var.equal =TRUE, conf.level = 0.95)

to obtain

2 Recall that we should compare the sample variances to each other, but since this time the sample standard
deviations are indeed close, the difference between their squares is not too large either.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
3
The test statistic did not change3, but the p-value did. It is 0.1976, far too large to reject the
null hypothesis of equal population means. Hence, we conclude that the average medical
expenses of those employees who regularly participate in their company’s lunchtime
exercise program is not significantly smaller than the average medical expenses of those
employees who do not exercise.

If we are not willing to assume equal population variances, then the appropriate R command
is

t.test(Expenses ~ Exercise, alternative = "greater",
conf.level = 0.95)

and it returns

As you can see, this time the two t-tests have practically the same degrees of freedom and
identical observed test statistic and p-values, so it makes no difference whether we assume
equal or unequal population variances.

b) What assumptions must hold to ensure the validity of the hypothesis test in part (a)
above? Does it appear that these conditions are satisfied?

The samples must be random and independent. Given that this is a textbook example, we
have no reason to question these requirements. The variable of interest should be
quantitative and continuous. The total dollar amount of medical expenses is a quantitative
variable. As it is given in dollars, the actual observations are discrete, but there are so many
different possible values that we can treat this variable as continuous. The sample sizes are
30, so CLT holds. However, the population standard deviations are unknown, so the
sampled populations should be normally distributed (at least not extremely non-normal).

The histograms and the usual descriptive statistics can be obtained like in Exercise 2 of
Tutorial 2, combined with the subset command you used in Exercise 2 of tutorial 4.

hist(subset(Expenses, Exercise == "yes"), col = "blue")
hist(subset(Expenses, Exercise == "no"), col = "green")

return the histograms shown on the next page. Both histograms are skewed to the right,
indicating that the sampled populations are unlikely to be normally distributed.

3 The test statistic value, in general, depends on the hypothesized parameter value, but is the same for two-
tail, left-tail and right-tail tests.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
4

qqnorm(subset(Expenses, Exercise == "yes"),
main = "Normal Q-Q Plot for Exercise = yes",
xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "blue")
qqline(subset(Expenses, Exercise == "yes"), col = "red")
Histogram of subset(Expenses, Exercise == "yes")
subset(Expenses, Exercise == "yes")
Fre
qu
en
cy
0 200 400 600 800 1000 1200 1400
0
5
10
15
20
Histogram of subset(Expenses, Exercise == "no")
subset(Expenses, Exercise == "no")
Fre
qu
en
cy
0 200 400 600 800 1000
0
5
10
15
20
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
5

qqnorm(subset(Expenses, Exercise == "no"),
main = "Normal Q-Q Plot for Exercise = no",
xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", col = "green")
qqline(subset(Expenses, Exercise == "no"), col = "red")

produce the following normal QQ plots.

-2 -1 0 1 2
0
20
0
40
0
60
0
80
0
12
00
Normal Q-Q Plot for Exercise = yes
Theoretical Quantiles
Sa
mp
le
Qu
an
tile
s
-2 -1 0 1 2
0
20
0
40
0
60
0
80
0
10
00
Normal Q-Q Plot for Exercise = no
Theoretical Quantiles
Sa
mp
le
Qu
an
tile
s
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
6
Only a few dots are close to the reference lines, so it seems unreasonable to assume that
the sub-populations of Expenses are normally distributed.

The

library(pastecs)
round(stat.desc(subset(Expenses, Exercise == "yes"),
basic = FALSE, desc = TRUE, norm = TRUE), 3)

commands provide the following statistics for the Exercise = yes group

while

round(stat.desc(subset(Expenses, Exercise == "no"),
basic = FALSE, desc = TRUE, norm = TRUE), 3)

returns

for the Exercise = no group.

As you can see, for both samples, the mean is far above the median and skewness appears
to be significantly positive4, confirming that the samples are skewed to the right. In addition,
the excess kurtosis statistics are also significantly positive5, so the distributions of the
samples are peaked and have relatively short tails compared to the corresponding normal
distributions. Finally, the reported p-values of the Shapiro-Wilk test are zero, rejecting the
null hypothesis of normality. Consequently, it is very unlikely that these samples have been
drawn from normally distributed populations.

c) Assuming that some of the assumption(s) mentioned above is (are) not satisfied, which
nonparametric hypothesis-testing procedure could be used? Conduct this test and give
the appropriate conclusion in the context of the problem. Compare your conclusions in
parts (a) and (c).

Since the sampled populations are most likely non-normal, the independent samples t-test
in part (a) is inappropriate. However, the samples are independent, the variable of interest
is continuous and is measured on a ratio scale, and the histograms suggest that the two
distributions have similar shapes. Therefore, we can rely on the Wilcoxon rank-sum test.

4 Look at the skew.2SE statistics, they are both well above 1.
5 Look at the kurt.2SE statistics, they are both well above 1.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
7
The hypotheses are

0 : , :N Y A N YH H    

The

library(exactRankTests)
wilcox.exact(Expenses ~ Exercise, alternative = "greater")

commands generate the following printout:

The p-value is 0.03681, so at the 5% level we reject H0 and conclude that the median
medical expenses of exercisers is significantly lower than the median medical expenses of
non-exercisers, so the insurance company should give discounts to companies that provide
exercise programs for their employees.

d) Compare your conclusions in parts (a) and (c).

In part (a) the parametric t-tests failed to reject the null hypothesis, but in part (c) the non-
parametric Wilcoxon rank-sum test detected ample evidence against the null hypothesis.
This illustrates that although parametric tests are more powerful than their non-parametric
counterparts when their assumptions are satisfied, they can be powerless when some of
their assumptions is violated.

Quit RStudio and save your RData and R files.

Exercise 5 (Selvanathan et al., p. 886, ex. 20.5)

In a taste test of a new beer, 25 people rated the new beer and another 25 rated the leading
brand on the market. The possible ratings were Poor, Fair, Good, Very Good, and Excellent.

a) Suppose the responses for the new beer and the leading beer were stored using a 1-2-
3-4-5 coding system (1 = Poor, …, 5 = Excellent). Based on the data saved in the t4e5a
file, can we infer that the new beer is rated less highly than the leading brand?

The variable of interest is rating, and there are two populations: ratings of the new beer and
ratings of the leading brand on the market. Notice, that although the possible values of rating
are numbers (1, 2, 3, 4 and 5), these numbers are just labels used to identify the categories
(poor, fair, good, very good and excellent), which have a natural order. Therefore, rating is
qualitative variable measured on an ordinal scale, and as such, it does not have a mean, its
central location is best captured by the median. It is also important to recognize that the data
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
8
set comprises two independent random samples on rating, one related to the new beer and
another one related to the leading beer on the market.

For these reasons, we need to perform a nonparametric test, namely the Wilcoxon rank-
sum test, for the comparison of the population medians. If we consider the population of
ratings of the new beer as population 1 and the population of ratings of the leading beer on
the market as population 2, then the relevant hypotheses are

0 1 2 1 2: , :AH H    

Launch RStudio, create a new project and script (t4e5), import the t4e5a data from Excel to
RStudio, load it to your project, and execute the following commands:

library(exactRankTests)
wilcox.exact(New_a, Leading_a, alternative = "less")

You should get the following output:

The p-value is 0.3929, far too big to reject the null hypothesis. Thus, the conclusion is that
the sample does not provide sufficient evidence to infer that the new beer is rated less highly
than the leading brand.

b) Suppose the responses were recoded so that 3 = Poor, 8 = Fair, 22 = Good, 37 = Very
Good, and 55 = Excellent. Based on the recoded data, saved in the t4e5b file, can we
infer that the new beer is rated less highly than the leading brand?

Import the t4e5b data from Excel to RStudio, load it to your existing project, and execute the
following commands:

library(exactRankTests)
wilcox.exact(New_b, Leading_b, alternative = "less")

You should get the following output:

Apart from the variable names, this printout is the same than in part (a).

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 4
9
c) What does this exercise tell you about ordinal data?

In the case of ordinal data, the actual numbers used to recode the original names or labels
of the categories are absolutely arbitrary. One could use any set of numbers, granted that
there are as many different numbers as categories and that these numbers preserve the
rankings of the categories.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5
1
(ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 5

Solutions

Exercises for Assessment

Exercise 5

In Exercise 2 of Tutorial 4, first we developed a confidence interval for the difference
between the mean ages of purchasers and non-purchasers of a particular brand of
toothpaste (part a), and then performed a t-test to see whether there was sufficient evidence
to conclude that there was a difference in the mean age of purchasers and non-purchasers
(part b). Based on the sample variances, in both cases we assumed that the two unknown
population variances are different. Let’s check now whether this assumption is supported by
the data.

Namely, using the same data,

a) Estimate the ratio of the two population variances with 95% confidence.

Last week we already checked the normality assumption and found no sign of extreme non-
normality. Thus, we can develop the required confidence interval the same way as we did
in Exercise 2.

The sample variances are s12 = 13.621192 = 185.54 and s22 = 10.039922 = 100.80.

Both sample sizes are 20 and the confidence level is (1-)100% = 95%. The required F
percentiles from Table 6(b) of Selvanathan (Appendix B, pp. 1080-81) are

1 2/2, 1, 1 0.025,19,19 2.53n nF F    

and
1 21 /2, 1, 1 0.975,19,19
0.025,19,19
1 1 0.3952.53n nF F F      

The confidence interval estimate is

1 2 1 2
2 2 2 2
1 2 1 2
/2, 1, 1 1 /2, 1, 1
/ / 185.54/100.80 185.54/100.80, , (0.728,4.660)2.53 0.395n n n n
s s s s
F F     
           

Therefore, with 95% confidence, the ratio of the variances of the populations of purchasers
and non-purchasers is somewhere between 0.728 and 4.660.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5
2
b) Can we conclude at the 5% significance level that the population variances differ? What
do you conclude if the significance level is increased to 10%?

The hypotheses are

2 2
1 1
0 2 2
2 2
: 1 , : 1AH H   

This is a two-tail test and the significance level is 5%, so we can rely on the 95% confidence
interval developed in part (a). Since that interval includes 1, the hypothesized ratio of the
two population variances, the null hypothesis of equal variances cannot be rejected at the
5% significance level.

The formal hypothesis test is as follows. The upper and lower critical values are the same F
values than in part (a), i.e.

1 2/2, 1, 1 0.025,19,19 2.53n nF F     and 1 21 /2, 1, 1 0.975,19,19 0.395n nF F    

The observed test static value is

2
1
2
2
185.54 1.841100.80obs
s
F
s
  

Since it is between the lower and upper critical values, at the 5% significance level we cannot
reject the null hypothesis and hence cannot conclude that the population variances differ.

At the 10% significance level the critical values are

1 2/2, 1, 1 0.05,19,19 2.17n nF F     and 1 21 /2, 1, 1 0.95,19,19 0.05,19,19
1 1 0.462.17n nF F F      

Since the observed test statistic value is still between the lower and upper critical values,
our decision and conclusion do not change.

To complete parts (a) and (b) in R, create a new project and script (t5e5), import the data
from the t4e2 Excel file and execute

attach(t4e2)
library(DescTools)
VarTest(Age ~ Householder)

The printout is at the top of the next page. The 95% confidence interval is about (0.729,
4.650), almost the same we got in part(a). The test statistic is about 1.841, the same as
above. The p-value is 0.1927, so we would fail to reject the null hypothesis even at the 19%
significance level.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5
3
F t est t o compar e t wo var i ances

dat a: Age by Househol der
F = 1. 8406, num df = 19, denom df = 19, p- val ue = 0. 1927
al t er nat i ve hypot hesi s: t r ue r at i o of var i ances i s not equal t o 1
95 per cent conf i dence i nt er val :
0. 728549 4. 650295
sampl e est i mat es:
r at i o of var i ances
1. 840643

Exercise 6 (Selvanathan, p. 558, ex. 13.58)

In a public opinion survey, 60 out of a sample of 100 high-income voters and 40 out of a
sample of 75 low-income voters supported the introduction of a new national security tax.
Can we conclude at the 5% level of significance that there is a difference between the
proportions of high- and low-income voters favouring a new national security tax? Do the
calculations both manually and with R.

Let X1 be the population of high-income voters and X2 be the population of low-income
voters. The proportions of voters who are in favour of a new national security tax is p1 and
p2. The hypotheses are

0 1 2 1 2: 0 , : 0AH p p H p p   

The sample proportions are

1 2
60 40ˆ ˆ0.6000 , 0.5333100 75p p   

Using these sample proportions as estimates of the corresponding population proportions,

1 1 2 2ˆ ˆ ˆ ˆ60 , (1 ) 40 , 40 , (1 ) 35np n p np n p     

They are all much bigger than 5, so we can rely on the normal approximation and perform
a Z-test.

The critical values are

/2 0.025 1.96z z  

and H0 is to be rejected if the calculated tests statistic is either smaller than -1.96 or larger
than 1.96.

Under the null hypothesis the hypothesized difference between the two population
proportions is D0 = 0, so we can estimate the common population proportion from the pooled
sample:

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 5
4
1 2
1 2
60 40ˆ 0.5714100 75
f f
p
n n
    

The estimate of the standard error is

1 2ˆ ˆ
1 2
1 1 1 1ˆ ˆ 0.5714 0.4286 0.0756100 75p ps pq n n
              

and the test statistic is

1 2
1 2
ˆ ˆ
ˆ ˆ 0.6000 0.5333 0.88230.0756obs p p
p p
z
s 
   

Since it is between the lower and upper critical values, we cannot reject the null hypothesis.
Therefore, at the 5% level there is no significant difference between the proportions of high-
and low-income voters favouring a new national security tax.

To perform this test in R, create a new RStudio project and script (t5e6), and execute the
following command:1

prop.test(x = c(60,40), n = c(100,75), correct = FALSE)

You should get

2- sampl e t est f or equal i t y of pr opor t i ons wi t hout cont i nui t y cor r ect i on

dat a: c( 60, 40) out of c( 100, 75)
X- squar ed = 0. 77778, df = 1, p- val ue = 0. 3778
al t er nat i ve hypot hesi s: t wo. si ded
95 per cent conf i dence i nt er val :
- 0. 08154755 0. 21488088
sampl e est i mat es:
pr op 1 pr op 2
0. 6000000 0. 5333333

The p-value is 0.3778, so we maintain the null hypothesis.

The chi-square test statistic is 0.77778 and its square root is about 0.8819, almost the same
than the Z test statistic we obtained manually, 0.8823.

1 The alternative and conf.level arguments can be omitted because this is a two-tail test at the 5% significance
level.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 6

Solutions

Exercises for Assessment

Exercise 5

A farmer wants to know if the weight of parsley plants is influenced by using a fertilizer. He
selects 90 plants and randomly divides them into three groups of 30 plants each. He applies
a biological fertilizer to the first group, a chemical fertilizer to the second group and no
fertilizer at all to the third group. After a month he weighs all plants and saves the
measurements in the t6e5 Excel file.

Can we conclude from these data at the 5% significance level that fertilizer affects weight?

a) Obtain the basic descriptive statistics with R and then perform the ANOVA F-test
manually.

This exercise is similar to Exercise 1, so we need to follow the same steps.

library(pastecs)
round(stat.desc(t6e5, basic = FALSE , desc = TRUE, norm = TRUE, p = 0.95),3)

returns the following descriptive statistics:

None Bi ol ogi cal Chemi cal
medi an 50. 000 53. 500 57. 500
mean 51. 200 53. 633 56. 967
SE. mean 1. 431 1. 617 1. 434
CI . mean. 0. 95 2. 926 3. 307 2. 933
var 61. 407 78. 447 61. 689
st d. dev 7. 836 8. 857 7. 854
coef . var 0. 153 0. 165 0. 138
skewness 0. 488 - 0. 049 - 0. 173
skew. 2SE 0. 571 - 0. 057 - 0. 203
kur t osi s - 0. 695 - 0. 951 - 0. 694
kur t . 2SE - 0. 417 - 0. 571 - 0. 417
nor mt est . W 0. 958 0. 984 0. 977
nor mt est . p 0. 271 0. 922 0. 741

H0: 1 = 2 = 3 and HA: not all three population means are the same.

k = 3, n = 3  30 = 90. The 5% critical value is F, k-1,n-k = F0.05,2,87  F0.05,2,90 = 3.10. Therefore,
H0 is to be rejected if Fobs > 3.10.

The calculations are as follows.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
2
1 51.20 53.63 56.97 53.933
k
jj
x
x
k
    

     
 
2 2 2
1
2
30 [ 51.20 53.93 53.63 53.93
56.97 53.93 ] 503.54
k
j j
j
SST n x x

      
  


503.54 251.771 2
SST
MST
k
  

 2 2 2 2 2
1 1 1
( 1) 29 (7.84 8.86 7.85 ) 5846.04j
nk k
ij j j j
j i j
SSE x x n s
  
         

5846.04 67.2087
SSE
MSE
n k
  

251.77 3.74767.20obs
MST
F
MSE
  

Since Fobs > 3.10, we reject H0 and conclude at the 5% significance level that fertilizer affects
weight.

b) Repeat the ANOVA F-test with R.

You need to execute the following commands

Weight = c(None, Biological, Chemical)
Fertilizer = gl(3, 30, 90, c("None", "Biological", "Chemical"))
summary(aov(Weight ~ Fertilizer))

to obtain

Df Sum Sq Mean Sq F val ue Pr ( >F)
Fer t i l i zer 2 503 251. 43 3. 743 0. 0276 *
Resi dual s 87 5845 67. 18
- - -
Si gni f . codes: 0 ‘ * * * ’ 0. 001 ‘ * * ’ 0. 01 ‘ * ’ 0. 05 ‘ . ’ 0. 1 ‘ ’ 1

The ANOVA F-test statistic is 3.743 and its p-value is 0.0276 < 0.05, so at the 5%
significance level we reject H0.

c) What are the required conditions for the tests in parts (a) and (b)? Do they seem to be
satisfied?

The ANOVA F-test assumes that (i) the data set constitutes k independent random samples
of independent observations drawn from k (sub-) populations; (ii) the variable of interest is
quantitative and continuous; (iii) the measurement scale is interval or ratio; (iv) each (sub-)
population is normally distributed and (v) has the same variance.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
3
The first condition is not testable. The variable of interest is the weight of parsley plants, a
quantitative and continuous variable measured on a ratio scale, so the second and third
conditions are satisfied. The normality assumption is supported by the descriptive statistics
and the Wilk-Shapiro tests on the first page.1 As for the last requirement, it can be checked
with the Levene test.

library(car)
leveneTest(Weight ~ Fertilizer)

return

Levene' s Test f or Homogenei t y of Var i ance ( cent er = medi an)
Df F val ue Pr ( >F)
gr oup 2 0. 5377 0. 586
87

The test statistic value is 0.5377 and its p-value is 0.586, so we can safely maintain the null
hypothesis of equal variances (i.e. homoskedasticity) at any reasonable significance level.

Consequently, we can rely on the ANOVA F-test.

d) Perform the Welch F-test in R. Does it lead to the same conclusion than the ANOVA F-
test?

oneway.test(Weight ~ Fertilizer)

returns

One- way anal ysi s of means ( not assumi ng equal var i ances)

dat a: wei ght s and f er t i l i zer s
F = 4. 0328, num df = 2. 000, denom df = 57. 827, p- val ue = 0. 02293

This time the Welch F-test statistic and p-value are very similar to the ANOVA F-test statistic
and p-value, so the two tests lead to the same conclusion. We can safely conclude at the
5% significance level that fertilizer affects weight, irrespectively of the (sub-) population
variances.

e) Perform the Kruskal-Wallis test in R (use  = 0.05). Does it lead to a different conclusion
than the parametric tests in parts (b) and (d)?

kruskal.test(Weight, Fertilizer)

produces the following printout:

Kr uskal - Wal l i s r ank sum t est

dat a: wei ght s and f er t i l i zer s
Kr uskal - Wal l i s chi - squar ed = 7. 3443, df = 2, p- val ue = 0. 02542

1 For the sake of brevity, I skip the explanation this time. Please note, however, that this answer would not be
appreciated in the assignment and on the exam. If you are asked to check normality, explain briefly what the
four quick checks suggest.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
4
The test statistic is 7.344. Since each sample size is large enough, we can rely on the
reported p-value based on chi-square approximation. It is 0.02542 < 0.05, so at the 5%
significance level it leads to the same conclusion than the F-tests in parts (b) and (d).
Namely, fertilizer affects the weight of parsley plants.

Exercise 6 (Selvanathan, p. 644, ex. 15.36)

A randomised block experiment produced the data listed below.

Treatment
Block 1 2 3 4
1 6 5 4 4
2 8 5 5 6
3 7 6 5 6

a) Conduct F-tests at the 5% significance level to find out whether

(i) the treatment means differ;
(ii) the block means differ.

Do the calculations first manually and then in R.

Obtain the required descriptive statistics with your Casio calculator:

,1 ,2 ,3 ,4
,1 ,2 ,3
7.000 , 5.333 , 4.667 , 5.333
4.750 , 6.000 , 6.000
5.583 , 1.165
T T T T
B B B
x x x x
x x x
x s
   
  
 

From these statistics,

2 2( -1) 11 1.165 14.929SS n s   

 2 2 2,
1
2 2
3 [(7.000 5.583) (5.333 5.583)
(4.667 5.583) (5.333 5.583) ] 3 2.972 8.916
k
T j
j
SST b x x

      
      


8.916 2.9721 3
SST
MST
k
  
 2 2 2 2,
1
4 [(4.750 5.583) (6.000 5.583) (6.000 5.583) ]
4 1.042 4.168
b
B i
i
SSB k x x

        
  


Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
5
4.168 2.0841 2
SSB
MSB
b
  

14.929 8.916 4.168 1.845SSE SS SST SSB      

1.845 0.3071 6
SSE
MSE
n k b
    

ANOVA for the treatment means

H0: T,1 = T,2 = T,3 = T,4 and HA: not all four T,j are the same

2.972 9.6810.307T
MST
F
MSE
  

0.05,3,6 4.76critF F 

Since the observed test statistic value is larger than the critical value, at the 5%
significance level we reject H0. Hence, there is enough evidence to conclude that the
treatment means are not all equal.

ANOVA for the block means

H0: B,1 = T,2 = B,3 and HA: not all three B,j are the same

2.084 6.7880.307B
MSB
F
MSE
  

0.05,2,6 5.14critF F 

Since the observed test statistic value is larger than the critical value, at the 5% significance
level we reject H0 and conclude that the block means are significantly different.

To do these tests in R, you need to import and reshape the data like in Exercise 4 to be able
to use the aov function. The

Y = c(as.matrix(t6e6[-1,-1]))
Treatment = gl(4, 3, 12)
Block = gl(3, 1, 12)
summary(aov(Y ~ Treatment + Block))

commands produce the following printout:

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
6
Df Sum Sq Mean Sq F val ue Pr ( >F)
Tr eat ment 3 8. 917 2. 9722 9. 727 0. 0101 *
Bl ock 2 4. 167 2. 0833 6. 818 0. 0285 *
Resi dual s 6 1. 833 0. 3056
- - -
Si gni f . codes: 0 ‘ * * * ’ 0. 001 ‘ * * ’ 0. 01 ‘ * ’ 0. 05 ‘ . ’ 0. 1 ‘ ’ 1

Because of rounding errors, there are some small differences between the test statistics
calculated manually and obtained in R, but the conclusions are the same.

(b) Conduct a Friedman test at the 5% significance level to determine whether the treatment
medians (central locations) differ. Do the calculations first manually and then in R.

Following the same steps than in Exercise 4, you obtain:

Treatment
Block 1 Ranks 2 Ranks 3 Ranks 4 Ranks
1 6 4.0 5 3.0 4 1.5 4 1.5
2 8 4.0 5 1.5 5 1.5 6 3.0
3 7 4.0 6 2.5 5 1.0 6 2.5
Tj 12.0 7.0 4.0 7.0

In this case, unlike in Exercise 4, there are ties. The uncorrected test statistic is

2 2 2 2 2
1
12 123 ( 1) (12 7 4 7 ) 3 3 5 6.6( 1) 3 4 5
k
r j
j
F T b k
b k k 
              

The correction factor is

3
3 3 3
1
3 3
( ) (2 2) (2 2) (2 2)1 1 0.9( ) 3(4 4)
b
i i
i
t t
C
b k k

          


and the corrected test statistic is

6.6 7.330.9
r
rc
F
F
C
  

The small-sample critical value is 7.4, a bit larger than the observed value of the test statistic
(7.33), so H0 is to be maintained. Consequently, at the 5% significance level there is not
enough evidence to conclude that the treatment medians are not all equal.

In R, you need to execute

friedman.test(Y ~ Treatment | Block)

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 6
7
to get

Fr i edman r ank sum t est

dat a: Y and Tr eat ment and Bl ock
Fr i edman chi - squar ed = 7. 3333, df = 3, p- val ue = 0. 062

The reported test statistic is 7.3333, the same as Frc, and the p-value is 0.062 > 0.05,
implying H0 at the 5% significance level. Note, however, that this p-value has been derived
from the chi-square distribution with df = 3, which is inaccurate because k = 4 and b = 3 are
relatively small.

(c) Are the required conditions of the Friedman test valid this time?

The Friedman test assumes that (i) the data is a random sample of b independent blocks of
k number of observations that are not independent of each other (i.e. the experimental
design is a randomised block design), (ii) the variable of interest is quantitative and
continuous, and (iii) the measurement scale is at least ordinal. The first assumption was said
to be satisfied. The second and third assumptions, however, cannot be verified because the
variable of interest is not specified this time.

Also recall, that the chi-square approximation to the sampling distribution of the Friedman
test statistic is good enough only if k > 6 and/or b > 24. This time, however, b = 3 and k = 4,
so the conclusion drawn in part (b) cannot be taken at face value. The appropriate 5% ‘small-
sample’ critical value is 7.4. It is smaller than the chi-square critical value, but still larger than
the observed test statistic value, so it does not alter our conclusion.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 7

Solutions

Exercises for Assessment

Exercise 6 (Selvanathan et al., p. 678, ex. 16.1)

Consider a multinomial experiment involving n = 300 trials and k = 5 cells. The observed
frequencies resulting from the experiments 1 to 5 are 24, 64, 84, 72, 56, and the hypotheses
to be tested are as follows:

H0: p1 = 0.1, p2 = 0.2, p3 = 0.3, p4 = 0.2, p5 = 0.2
HA: at least one pi (i = 1, 2, 3, 4, 5) is not equal to its value specified in H0.

Test the null hypothesis at the 1% significance level.

This is an example for the chi-square test of goodness of fit.

The critical value is  2,k-1 =  20.01,4 = 13.3 and H0 is to be rejected if the calculated test
statistic value is larger than this critical value.

The expected frequencies are equal to the number of trials times the probabilities under
H0. The details are shown in the following table:

i pi,0 oi ei (oi‐ei)2/ei
1 0.1000 24 30 1.200
2 0.2000 64 60 0.267
3 0.3000 84 90 0.400
4 0.2000 72 60 2.400
5 0.2000 56 60 0.267
Sum 1.000 300 300 4.533

The expected frequencies are all large enough (i.e. ≥ 5), so the chi-square approximation
is valid.

Since  2obs = 4.533 < 13.3, H0 cannot be rejected at the 1% level. Hence, there is not
sufficient evidence to conclude that at least one pi is not equal to its value specified in H0.

To perform this test in R, you just need to execute the following command:

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7
2
chisq.test(c(24, 64, 84, 72, 56), p = c(0.1, 0.2, 0.3, 0.2, 0.2))

The printout is

Chi - squar ed t est f or gi ven pr obabi l i t i es

dat a: c( 24, 64, 84, 72, 56)
X- squar ed = 4. 5333, df = 4, p- val ue = 0. 3386

The p-value is 0.3386, so H0 cannot be rejected at any reasonable significance level.

Exercise 7

Return to the case study described in Exercise 2. Is it possible to infer at the 1% significance
level that the preference for Australian made grocery products (Aussie) and the impact of
brand name on product choice (Brand) are related to each other? Perform a chi-square test
of independence with R.

The null hypothesis is that Aussie and Brand are independent of each other, while the
alternative hypothesis is that they are related to each other.

Following similar steps than in part (a) of Exercise 2,

chisq.test(Aussie, Brand)

returns the following printout:

Pear son' s Chi - squar ed t est

dat a: Aussi e and Br and
X- squar ed = 10. 839, df = 4, p- val ue = 0. 02843

There is no warning message, so all expected frequencies are at least five. The (Pearson)
chi-square test statistic value is 10.839 and the corresponding p-value is 0.0284.
Consequently, at the 1% significance level we cannot reject H0 and conclude that the
preference for Australian made grocery products (Aussie) and the impact of brand name on
product choice (Brand) might not be related to each other.

Exercise 8

A survey was conducted in five countries. The percentages of respondents whose
household members own more than one personal computer, laptop, notebook or iPad are
as follows:

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7
3
Australia
New Zealand
China
Japan
South Korea
53%
48%
38%
54%
49%

Suppose that the survey was based on 500 respondents in each country.

(a) At the 0.05 level of significance, determine whether there is some significant difference
in the proportion of households in these countries who own more than one computer
(personal computer, laptop, notebook or iPad). Do the calculations first manually and
then in R.

This is an example for the application of the chi-square test of homogeneity. The null
hypothesis is that the proportion of households who own more than one computer is the
same in these countries, while the alternative hypothesis is that there are differences.

The number of surveyed households who own more than one computer is 0.53  500 = 265
in Australia, 0.48  500 = 240 in New Zealand, 0.38  500 = 190 in China, 0.54  500 = 270
in Japan and 0.49  500 = 245 in South Korea. Accordingly, the number of surveyed
households who have at most one computer is 235 in Australia, 260 in New Zealand, 310
in China, 230 in Japan and 255 in South Korea.

These observed frequencies can be summarised in a 2  5 contingency table and the
calculations can be performed like in Exercise 3.

oij
Country
Computer Australia NZ China Japan Korea Total
More than one 265 240 190 270 245 1210
None or one 235 260 310 230 255 1290
Total 500 500 500 500 500 2500

eij
Country
Computer Australia NZ China Japan Korea Total
More than one 242.00 242.00 242.00 242.00 242.00 1210.00
None or one 258.00 258.00 258.00 258.00 258.00 1290.00
Total 500.00 500.00 500.00 500.00 500.00 2500.00

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7
4
oij2/eij
Country
Computer Australia NZ China Japan Korea Total
More than one 290.19 238.02 149.17 301.24 248.04 1226.65
None or one 214.05 262.02 372.48 205.04 252.03 1305.62
Total 504.24 500.03 521.65 506.28 500.07 2532.27

From the table bove, the observed test static is

2
2
1 1
2532.27 2500 32.27r c ij
i j ij
o
n
e

 
    

The degrees of freedom is df = (2  1)(5 – 1) = 4 and the 5% critical value is 2,df = 20.05,4
= 9.49. The observed test statistic value is larger than this critical value, so at the 5%
significance level we reject H0 and conclude that there is some significant difference in the
proportion of households in these countries who own more than one computer.

You can either import the four columns of the observed frequencies into R from the t7e8
Excel file or can enter them from the keyboard. Either way, you need to create a matrix of
the four columns (I call this matrix Respondents) and perform the test on it.

Respondents = cbind(Australia, NZ, China, Japan, Korea)
chisq.test(Respondents)

You should get

Pear son' s Chi - squar ed t est

dat a: Respondent s
X- squar ed = 32. 273, df = 4, p- val ue = 1. 682e- 06

(b) Find the approximate p-value of the test in (a) from the relevant statistical table.

The p-value is equal to the probability that a chi-square random variable with 4 degrees of
freedom takes on a number equal to or larger than the observed test statistic value. From
the chi-square table the largest critical value with df = 4 and  = 0.005 is 14.9. Since it is
still smaller than 2obs = 32.27, the p-value is smaller than 0.005.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 7
5
Exercise 9

In Exercise 4 you performed a t-test on the Pearson correlation coefficient between Price
and Odometer and concluded at the 5% significance level that there is a significantly
negative linear relationship between them. Later, however, you realised that this test might
be misleading because Price is probably non-normal.

To double check your conclusion, calculate and test the Spearman correlation coefficient
with R.

You need to execute

cor.test(Price, Odometer, alternative = "less",
method = "spearman", exact = FALSE)

to obtain

Spear man' s r ank cor r el at i on r ho

dat a: Pr i ce and Odomet er
S = 300540, p- val ue < 2. 2e- 16
al t er nat i ve hypot hesi s: t r ue r ho i s l ess t han 0
sampl e est i mat es:
r ho
- 0. 8034326

As you can see, the Spearman sample correlation coefficient (-0.8034326) is very similar
to the Pearson sample correlation coefficient (-0.8082646) and it is significantly negative.
Hence, we conclude that there is a significantly negative relationship between Odometer
and Price.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 8

Solutions

Exercises for Assessment

Exercise 3

Lotteries have become important sources of revenue for governments. Many people have
criticised lotteries, however, referring to them as a tax on the poor and uneducated. In an
examination of the issue a random sample of 100 adults was asked how much they spend
on lottery tickets as a percentage of the total household income. They were also interviewed
about various socioeconomic variables, like number of years of education, age, number of
children, and personal income (in thousands of dollars). The data are stored in file t8e3.

Obtain and test appropriate correlation coefficients with R to study the following beliefs. Use
 = 0.05.

a) Relatively uneducated people spend a greater proportion of their income on lotteries
than do relatively educated people.

b) Older people spend a greater proportion of their income on lottery tickets than do
younger people.

c) People with more children spend a greater proportion of their income on lotteries than
do people with fewer children.

d) Relatively poor people spend a greater proportion of their income on lotteries than do
relatively rich people.

You learnt about two correlation coefficients, the Pearson correlation coefficient and its
nonparametric counterpart, the Spearman correlation coefficient. The Pearson correlation
coefficient is appropriate when the variables are quantitative and are measured on an
interval or on a ratio scale. The t-test for H0: xy = 0, however, is based on the stronger
assumption that both variables, X and Y, are normally distributed. If these requirements are
met, it is better to use the Pearson correlation coefficient, but if not, you should rely on the
nonparametric Spearman correlation coefficient.

The five variables in this example are all quantitative. The actual measurements, however,
are certainly not normally distributed because they are rounded to the nearest integers and
hence, they are discrete. Still, if these variables assume relatively large numbers of integers,
it might be possible to approximate their distributions with normal distributions.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
2
In order to decide whether this is the case, it is worth to have a look at the usual descriptive
statistics.

According to the Minimum and Maximum values, in the samples at hand Children assumes
only 7 different integers, Lottery and Education take on only 14 different integers, while Age
and Income assume 62 and 85 different integers, respectively. Given the small numbers of
different actual values, the distributions of Children, Lottery and Education clearly cannot be
approximated with normal distributions.

As regards the other two variables, the SW test rejects normality at any significance level
for Income and at the 5% significance level for Age.

For these reasons, for any pair of variables the strength of the (linear) relationship is best
measured by the Spearman correlation coefficient.

a) The belief that relatively uneducated people spend a greater proportion of their income
on lotteries than do relatively educated people implies a negative correlation between
Lottery and Education and accordingly the hypotheses are H0: s = 0 and HA: s < 0.
The appropriate R command (see Exercise 5 of Tutorial 7) is

cor.test(Lottery, Education, method = "spearman",
exact = TRUE, alternative = "less")

which returns

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
3

Spearman rho is about -0.603 < 0 and its reported p-value is practically zero.1 Thus, H0
can be rejected at any reasonable significance level implying that there is a negative
correlation between Lottery and Education.

b) The belief that older people spend a greater proportion of their income on lottery tickets
than do younger people implies a positive correlation between Lottery and Age, so H0:
s = 0 and HA: s > 0.

Spearman rho is about 0.141 > 0 and its p-value is about 0.0809. Thus, H0 cannot be
rejected at the 5% level, i.e. Lottery and Age are only insignificantly positively correlated
with each other. Hence the data does not support the second hypothesis.

c) The belief that people with more children spend a greater proportion of their income on
lotteries than do people with fewer children implies a positive correlation between Lottery
and Children, so H0: s = 0 and HA: s > 0.

Spearman rho is about -0.042 < 0 and its p-value is 0.6618, far too large to reject the
null hypothesis of no correlation in favour of the alternative of a positive correlation
between Lottery and Children.

1 This reported p-value is just an approximation because, as R warns us, there are ties. Still, it is so small
(1.617 / 1011) that we have every reason to assume that the exact p-value is also smaller than any reasonable
significance level.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
4
d) The belief that relatively poor people spend a greater proportion of their income on
lotteries than do relatively rich people implies a negative correlation between Lottery
and Income, so H0: s = 0 and HA: s < 0.

Spearman rho is about -0.532 < 0 and its p-value is practically zero. Thus, H0 can be
rejected at the 5% level, or at any reasonable significance level. This means that there
is enough evidence to conclude that the belief is probably correct - there is a negative
correlation between Lottery and Income.

Exercise 4 (Selvanathan et al., p. 765, ex. 17.74)

The head office of a life insurance company believed that regional managers should have
weekly meetings with their salespeople, not only to keep them abreast of current market
trends but also to provide them with important facts and figures that would help them in their
sales. Furthermore, the company felt that these meetings should be used for pep talks. One
of the points the management felt strongly about was the high value of new contact initiation
and follow-up phone calls.

To dramatize the importance of phone calls on prospective clients and (ultimately) on sales,
the company undertook the following small study. Twenty randomly selected life insurance
salespeople were surveyed to determine the number of weekly calls they made and the
number of policy sales they concluded. The data (Calles and Sales) are saved in file t8e4.
Perform the following tasks with R.

a) Do you expect Calls and Sales to be related to each other? If yes, do you expect the
relationship between them to be positive or a negative? Which variable is likely
determining the other?

If management is right and new contact initiation and follow-up phone calls are indeed
useful, then Calls and Sales are certainly related to each other and the relationship
between them is positive. Moreover, everything else held constant, Calls can be
expected to influence Sales, not the other way around.

b) Illustrate the data on a scattergram. What does this plot suggest about the relationship
between the two variables?

Calls (independent variable) is measured on the horizontal axis and Sales (dependent
variable) on the vertical axis.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
5
The

plot(Calls, Sales,
main = "Scatterplot of Sales versus Calls",
col = "green", pch = 19)

command generates the scatterplot on the next page. It shows that the two variables
tend to move in the same direction, so in this sample there is indeed a positive, and
seemingly strong, linear relationship between Calls and Sales, as expected.

c) Find the correlation coefficient between Calls and Sales. What does this coefficient and
the corresponding t-test statistic and p-value tell you about the relationship between the
two variables? Can we rely on this t-test?

Calls and Sales are both quantitative variables so the strength of a linear relationship
between them can be measured by the Pearson correlation coefficient.

The Pearson correlation coefficient is about 0.955, so in this sample there is indeed a
strong linear relationship between Calls and Sales. The corresponding t-statistic is
13.615 > 0 and the p-value for a right-tail test is zero, thus H0: xy = 0 can be safely
rejected in favor of HA: xy > 0. Hence, there is overwhelming evidence to infer that there
is a positive linear relationship between Calls and Sales.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
6
This t-test for the Pearson population correlation coefficient assumes that both sampled
populations are normally distributed. A quick look at the descriptive statistics and the
SW test results (see next page) suggests that, despite the limited sample size, there is
no reason to question normality.

d) Find the least squares regression line that expresses the number of Sales as a function
of the number of Calls.

The

summary(lm(Sales ~ Calls))

R command generates the following printout:

e) What do the coefficients tell you?

From the Estimate column the point estimates of the intercept and slope parameters are

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 8
7
0 1ˆ ˆ2.059 , 0.345   

The y-intercept estimate is negative. Since it refers to number of sales, this point
estimate is clearly meaningless.

The slope estimate tells us that by every additional call a week the number of policies
sold is expected to increase by 0.345.

f) What proportion of the variability in the number of sales can be attributed to the variability
in the number of calls?

The answer to this question is provided by the coefficient of determination, R2 (Multiple
R-squared on the printout). It is about 0.911, meaning that about 91% of the total sample
variation in Sales can be attributed to the variation in Calls, and thus can be explained
by this simple linear regression model.

g) Is there enough evidence (with  = 0.05) to indicate that the larger the number of calls,
the larger the number of sales?

The question implies the following hypotheses about the slope coefficient:

0 1 1: 0 : 0AH vs H  

and the hypothetical parameter value is zero.

The observed t-statistic is 13.616. It is positive, as implied by the alternative hypothesis,
and its p-value, half of Pr (> | t |), is zero. Hence, the slope estimate is significantly
positive, practically at any level, and more calls can be expected to generate more sales.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 9

Solutions

Exercises for Assessment

Exercise 3 (Selvanathan et al., p. 827, Case 18.4)

A leader of the Workers Union in New Zealand would like to study the movement in the
average hourly earnings of New Zealand workers. He collected and recorded data on
average earnings (AE, $), labour cost1 (LC, $) and rate of inflation (RI, %). His data are
saved in the t9e3 file.

a) Set up a suitable regression model to investigate the impact of labour cost and rate
inflation on the hourly earnings of an average New Zealand worker.

Given the objective of the research project, the population regression model is

0 1 2i i i iAE LC RI      

b) Do you expect the slope parameters to be positive or negative?

Labour cost and inflation are likely to put an upward pressure on average earnings, so
both slope parameters are expected to be positive.

c) Estimate the regression model.

1 Labour cost is the sum of all wages paid to employees, as well as the cost of employee benefits and payroll
taxes paid by an employer.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9
2
d) Do the estimated slope coefficients have the logical signs? Carefully explain the
meanings of the estimated slope coefficients.

The slope estimate of LC is positive, as expected. However, the slope estimate of RI is
negative and this does not seem to be reasonable.

The slope estimates suggest that

(i) Given the rate of inflation, every additional dollar labour cost is expected to raise the
average hourly earnings by about 2.7 cents;
(ii) Given the labour cost, a 1 percentage point increase of the inflation rate is expected
to bring down the average hourly earnings by about 1.7 cents.

e) What do the unadjusted and the adjusted coefficients of determination tell you about the
quality of the fit?

In this case there is hardly any difference between R2 and Adj. R2. This is because

2 211 (1 )1
n
R R
n k
   

and

1 67 1 1.031251 67 2 1
n
n k
     

is very close to one, so 2 21 1R R   .

Both statistics are almost one, implying that this regression model can explain almost
all the variations in the average hourly earnings.

Suppose now that you have not estimated the regression model yourself but received a
hard copy of the R printout from a friend. However, your friend’s printer was running out
of ink and some details are not visible on your copy, which is shown on the top of the
next page.

Complete the remaining tasks using only this incomplete printout and the relevant
statistical tables.

f) Try to recover the missing y-intercept estimate.

The t-statistic is the point estimate divided by the standard error, so

0 0ˆ ˆ0
ˆ 101.451 0.089571 9.087t s        

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9
3

g) What are the missing t-Statistic and Prob. value for RI?

2
2
2
ˆ
ˆ
ˆ 0.017015 0.825090.020622t s 
    

On the R regression printout the Pr(> | t |) value of a t-test is the p-value for a two-tail
test with zero hypothesized parameter value, i.e. twice the probability that the t-test
statistic assumes a value that is at least as extreme as the observed test-statistic value.
Therefore,

64 65 65
65
2 ( 0.82509) 2 ( 0.82509) 2 ( 0.82509)
2 ( 1.295) 2 0.10 0.20
df df df
df
P t P t P t
P t
  

         
     

Note, that the second last step was based on the table value t0.1, 65 = 1.295.

h) Perform the F-test of overall significance at the 0.005 level. State the null and alternative
hypotheses, show the calculation of the test statistic, make a statistical decision based
on the critical value approach and on the p-value approach, respectively, and draw your
conclusion.

The hypotheses are

0 1 2 1 2: 0 . : 0 / 0AH vs H or and      

or, equivalently,

2 2
0 : 0 . : 0AH R vs H R 

The critical value is

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9
4
, , 1 0.005,2,64 0.005,2,60 5.79k n kF F F     

and H0 is to be rejected if the observed test statistic value is larger than this critical value.

From the formula of the F-test statistic based on R2,

2
2
/ 0.999346 / 2 48897.66(1 ) / ( 1) (1 0.999346) / 64obs
R k
F
R n k
     

Since the observed value of the test statistic is far bigger than the critical value, at the
0.005 we reject H0 and conclude that the model is useful because either LC or RI or
both have some significant effect on AE.

From the F-table, we cannot obtain the p-value of this test, but it is certainly smaller than
0.005 because

1 2 1 2 1 22, 64 2, 70 2, 70( 48897.66) ( 48897.66) ( 5.72) 0.005df df df df df dfP F P F P F          

Exercise 4

In part (e) of Exercise 2 you performed a general F-test with R on the following hypotheses:

0 2 3 2 3 2 3: 1.8, 3.2 , : 1.8 3.2 1.8, 3.2AH H or or          

a) Derive and estimate the restricted regression implied by the null hypothesis.

By plugging the restriction in the model, we obtain

0 1 2 3 0 1 1.8 3.2time depart reds trains depart reds trains                

so the restricted model is

0 11.8 3.2time reds trains depart      

The corresponding R regression printout is on the next page.

b) Using the sum of squares for errors from the unrestricted and restricted regressions
perform the general F-test manually at the 5% significance level. Did you manage to get
the same results than in part (e) of Exercise 1? Would it be possible to calculate the test
statistic from the coefficients of determination of the unrestricted and restricted
regressions?

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9
5
The null hypothesis comprises two linear restrictions, 2 = 1.8 and 3 = 3.2, so m = 2.
From the restricted regression the sum of squares due to error is SSEr = 3952.042, so
the observed general F-test statistic is

1 231 3 1 3952.04 3729.870
3729.87
2 6 12 0 .76
r
obs
SSE SSEn k
F
m SSE
      

The critical value is F,df1,df2 = F0.05,2,227  F0.05,2, = 3.00. It is smaller than the observed
test statistic value, so at the 5% significance level we reject H0. The test statistic and the
statistical decision are the same than in part (e) of Exercise 2.

This time the test statistic could not be calculated from the coefficients of determination
because the unrestricted and the restricted regressions have different dependent
variables.

c) Using a 5% significance level, test the null hypothesis that Bill’s expected delay from a
train at the Murrumbeena level crossing is 3.5 minutes and the delay from a train at the
Murrumbeena level crossing is double that from a red light. Perform the test with R only.

The hypotheses are

0 3 3 2 3 3 2 3 3 2: 3.5, 2 , : 3.5 2 3.5, 2AH H or or             

Use the linearHypothesis function of the car package and specify the null hypothesis as
c("trains = 3.5", "trains = 2*reds"). The relevant printout is o the next page.

The F-test statistic is 7.6036 and its p-value is 0.0006. Hence, at the 5% significance
level we reject H0 and conclude that the expected delay from a train at the Murrumbeena
level crossing is either different from 3.5 minutes, or is not double that from a red light,
or both.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 9
6

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 10

Solutions

Exercises for Assessment

Exercise 4 (Gujarati, pp. 370-374)

This exercise is based on a study published by James W. Longley in 1967 about the
computational accuracy of least-squares estimates in several computer programs.1 This
study is clearly outdated by now, but the Longley data has become the workhorse to illustrate
several econometric problems, including multicollinearity. This data set is saved in the t10e4
Excel file. It contains U.S. time series data for the years 1947–1962 on the following seven
variables.

emp = number of people employed, in thousands;
def = GNP implicit price deflator;
gnp = GNP, millions of dollars;
unemp = number of people unemployed in thousands,
arm = number of people in the armed forces,
pop = noninstitutional population over 14 years of age2; and
year = year, equal to 1 in 1947, 2 in 1948, …, and 16 in 1962.

Assume that our objective is to model Y on the basis of the six X variables. Although in
practice after having estimated a regression model we should always assess and interpret
the results, this time we skip some of the usual steps and focus on multicollinearity.

a) Using R, estimate a multiple linear regression model.

The

m = lm(emp ~ def + gnp + unemp + arm + pop + year)
summary(m)

commands generate the printout on the top of the next page.

1 Longley, J.W. (1967): An appraisal of least-squares programs from the point of view of the user. Journal of
the American Statistical Association, vol. 62, pp. 819–841.
2 In the United States, the civilian noninstitutional population refers to people residing in the 50 States and the
District of Columbia who are not inmates of institutions (penal, mental facilities, homes for the aged), and who
are not on active duty in the Armed Forces. (Wikipedia).
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
2

b) Apply the three simple indicators or rules of thumb that can be used to detect imperfect
multicollinearity. Based on them, does multicollinearity seem to be a problem in this
regression model? Explain your opinion.

i. R2 is very high (0.9955), but three of the six independent variables (def, gnp and
pop) are statistically insignificant.3 This is a classic symptom of multicollinearity.

ii. Execute

library(Hmisc)
rcorr(as.matrix(t10e4), type = "pearson")

to obtain the correlation4 matrix for the variables in the model and the
corresponding p-values. According to the results displayed on the next page some
of the independent variables are strongly correlated with each other (|r | > 0.8),
namely gnp and def, pop and def, pop and gnp, year and def, year and gnp, and
year and pop. This suggests that there may be a severe multicollinearity problem.

iii. To obtain the Variance Inflation Factors, execute

library(car)
round(vif(m), 4)

3 As for the other three independent variables, at any reasonable significance level, the slopes of unemp and
arm are significantly negative, and the slope of year is significantly positive. The negative slopes of unemp and
arm make sense as, ceteris paribus, more unemployed and more people in the armed forces are expected to
decrease the number of people employed. The positive slope of year is also reasonable because this time
variable can be considered as a proxy for some omitted variables whose combined effect on the number of
people employed increases every year.
4 All variables in the model are quantitative, so we use the Pearson correlation coefficient.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
3

The VIF values are

Clearly, every independent variable has extremely high (i.e. much larger than 5) VIF
statistic, except arm, suggesting that indeed the Longley data are plagued by the
multicollinearity problem.

All things considered, multicollinearity appears to be severe this time.

c) You learnt on the lectures that the problem of severe multicollinearity in general might
be mitigated by increasing the sample size, or transforming some of the multicollinear
variables, or dropping all but one of the multicollinear variables. The first option is not
available for us, but as for the second and third, one might argue as follows.

(i) Because of inflation, nominal GNP (gnp) and the GNP implicit price deflator (def)
are likely strongly correlated, so instead of these variables it might be better to use
real GNP, which is nominal GDP divided by the implicit price deflator, i.e. rgnp = gnp
/ def.

(ii) Noninstitutional population over 14 years of age tends to increase in time, so pop
and year are also likely strongly correlated with each other. A possible solution is to
keep pop in the model but drop year.

(iii) The number of unemployed (unemp) and noninstitutional population over 14 years
of age (pop) can be also strongly correlated with each other, so it might be a good
idea to keep pop but drop unemp.

Incorporate these changes in the model and estimate the new model.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
4
The new regression can be estimated with the

rgnp = gnp/def
m2 = lm(emp ~ rgnp + arm + pop)
summary(m2)

R commands. The new regression printout is below.

d) Apply the three simple indicators or rules of thumb for the detection of imperfect
multicollinearity on the regression you estimated in part (c). Does multicollinearity seem
to be a problem in this new regression model? Explain your opinion.

i. R2 is still very high (0.9814) and, at the 5% level, the first slope is significantly
positive and the other two slopes are significantly negative. Hence, multicollinearity
does not appear to be as severe in the new model as in the original model.
However, the sign of the third slope estimate does not seem to be logical5, so even
if we managed to mitigate multicollinearity, the new model is still not ideal.

ii. The new correlation matrix obtained by executing

rcorr(cbind(emp, rgnp, arm, pop), type = "pearson")

is on the next page. As pop and rgnp are strongly correlated with each other,
multicollinearity still appears to be an issue.

iii. The Variance Inflation Factors generated by the

round(vif(m2), 4)

5 Ceteris paribus, higher real GNP and less people in the armed forces are expected to be accompanied with
higher employment, so the signs of the first and second slopes are logical. The negative sign of the third slope,
however, is surprising – why would employment decrease when population increases?
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
5
command are also on the next page.

VIF:

One VIF value is smaller than 5, but the other two are very large, so the new model
is also plagued by the multicollinearity problem.

Hence, multicollinearity appears to be severe in the new model as well.

Exercise 5 (Selvanathan et al., p. 825, ex. 18.48)

The Director of the Department of Education in Queensland was analysing the last year
average mathematics test scores in the schools under his control. He noticed that there
were dramatic differences in scores among the schools. In an attempt to improve the scores
of all the schools, he attempted to determine the factors that account for the differences.
Accordingly, he took a random sample of 40 schools across the state and, for each,
determined the mean mathematics test score, the percentage of teachers in each school
who have at least one university degree in mathematics (math), the mean age, and the
mean annual income ($ ‘000) of the mathematics teachers. These data are saved in the
t10e5 Excel file.

a) Perform a multiple regression analysis on these data with R. What is your sample
regression equation?

The

m = lm(score ~ math + age + income)
summary(m)

commands generate the regression printout displayed on the top of the next page.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
6

From the printout the sample regression equation is

 35.7 0.247 0.245 0.133i i i iscore math age income   

b) Is the model useful in explaining the variation among schools? Explain.

To answer this question, one needs to evaluate the F-test of overall significance and to
interpret the adjusted coefficient of determination.

As regards the F-test of overall significance, the null hypothesis is that none of the
independent variables help explaining the dependent variable, i.e. every slope parameter is
zero, while the alternative hypothesis is that at least one independent variable is important
and thus its slope parameter is different from zero. In symbols,

0 1 2 3: 0 , : ' 0'A iH H at least one      

The test statistic is Fobs = 6.663 and the corresponding p-value is about 0.001, so the null
hypothesis can be rejected even at the 0.5% significance level. Therefore, we conclude that
the model is useful as at least one independent variable has a significant effect on the mean
test score.

The adjusted coefficient of determination is 0.303. It means that after having taken the
sample size and the number of independent variables into consideration, about 30% of the
total sample variation of the mean test scores can be accounted for by the variations in the
three independent variables, math, age and income.

c) Are the normality and homoskedasticity conditions satisfied? Explain.

Executing

olsres = residuals(m)
hist(olsres)
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
7
qqnorm(olsres, pch = 1)
qqline(olsres)

library(pastecs)
stat.desc(olsres, basic = FALSE, norm = TRUE)
shapiro.test(olsres)

yhat = fitted.values(m)
plot(yhat, olsres,
main = "OLS residuals versus yhat",
col = "red", pch = 19, cex = 0.75)

library(lmtest)
bptest(m, ~ math + age + income +
I(math^2) + I(age^2) + I(income^2) +
I(math * age) + I(math * income) + I(age * income))

you can generate the following graphs and printouts:

-2 -1 0 1 2
-20
-15
-10
-5
0
5
10
15
Normal Q-Q Plot
Theoretical Quantiles
Sa
mp
le
Qu
an
tile
s
Histogram of olsres
olsres
Fre
qu
en
cy
-20 -10 0 10 20
0
5
10
15
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
8

The histogram seems to have a longer left tail than right tail, the mean is smaller than the
median and skewness is negative. These indicate that the distribution of the residuals is
skewed to the left and thus, being not symmetrical, it is not normally distributed. However,
skewness is close to zero, excess kurtosis is close to zero, skew.2SE and kurt.2SE are both
very small in absolute value, and the SW test is insignificant. Hence, the random error
variables, i, might be normally distributed.

As for homoskedasticity, the residual plot does not reveal any discernible pattern that would
suggest heteroskedasticity and the White test has a large p-value (0.3845), implying that
the null hypothesis of homoskedasticity is maintained at any reasonable significance level.
Hence, there is no reason to doubt the validity of the homoskedasticity assumption.

d) Is multicollinearity a problem? Explain.

Two of the three independent variables, age and income, are clearly insignificant (their p-
values are 0.1945 and 0.3889, respectively), and the coefficient of determination is relatively
small (0.357). Hence, there is no contradiction between the overall quality of the model
(poor) and the individual significance/insignificance of the independent variables. This
implies that imperfect multicollinearity is unlikely to be severe.

The

library(Hmisc)
rcorr(as.matrix(t10e5), type = "pearson")
55 60 65 70
-20
-15
-10
-5
0
5
10
15
OLS residuals versus yhat
yhat
ols
res
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
9
library(car)
round(vif(m), 4)

commands return the following results:

The strongest correlation is between age and income (r = 0.57), i.e. between two
independent variables. Still, even this correlation is only moderately strong, so
multicollinearity might not be very severe.

The three VIF values are all smaller than 1.5, far below the threshold value of 5.

All things considered, multicollinearity does not appear to be severe in this model.

e) Interpret and the coefficients. Do you find their signs reasonable? Why or why not?
Based on your expectations perform t-tests on the coefficients (use  = 0.05).

In this case the y-intercept does not have a meaningful interpretation as neither age nor
income of mathematics teachers can be zero in real life.

The first slope estimate suggests that, keeping age and income constant, a percentage point
increase in the proportion of teachers who have at least one university degree in
mathematics increases the mean mathematics test score by about 0.247.

The second slope estimate suggests that, keeping math and income constant, when the
mean age of the mathematics teachers increases by one year, the mean mathematics test
score increases by about 0.245.

The third slope estimate suggests that, keeping math and age constant, when the mean
annual income of the mathematics teachers increases by 1000$, the mean mathematics
test score increases by about 0.133.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 10
10
All three slope estimates are positive. One might argue that this makes sense, as better
qualified (math), more experienced (age) and better paid (income) mathematics teachers
are probably doing better jobs.

Based on this argument, we perform right-tail t-tests on the slopes with

0 : 0 , : 0 ( 1, 2,3)i A iH H i   
The p-value, i.e. half of Pr(> | t |), is about 0.0005 < 0.05 for math, but it is 0.0.0973 > 0.05
for age and 0.1944 > 0.05 for income. Hence, at the 5% level, math has a significantly
positive effect on score, but the individual effects of age and income on score are only
insignificantly positive.

f) Test the null hypothesis that neither the teachers’ mean age nor their mean annual
income has a significant effect on the average mathematics test scores (use  = 0.05).

This question requires to perform a general F-test with

0 2 3 2 3 2 3: 0 , : 0, 0, 0 0AH H or or and          

Execute

linearHypothesis(model = m, c("age = 0", "income = 0"))

to obtain the following printout:

The F-statistic is 2.8093 and its p-value is 0.0735, so at the 5% significance level we
maintain H0 and conclude that age and income are jointly insignificant.6

6 LK: In part (c) we saw that the random error variables might be normally distributed, so the F test is
appropriate.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11
1
ECON20003 – QUANTITATIVE METHODS 2

TUTORIAL 11

Solutions

Exercises for Assessment

Exercise 5 (Selvanathan et al., p. 846, ex. 19.7)

Create and identify indicator variables to represent the following nominal variables.

a) Religious affiliation (Catholic, Protestant and other).

This nominal/qualitative variable has three possible values/categories, which can be
represented by two indicator/dummy variables. There are three equivalent options.

(i) Dc = 1 for Catholic and 0 otherwise (i.e. non-Catholic)
Dp = 1 for Protestant and 0 otherwise (i.e. non-Protestant)

In this case Dc = 1 and Dp = 0 imply Catholic, Dc = 0 and Dp = 1 imply
Protestant, and Dc = 0 and Dp = 0 imply other (i.e. neither Catholic nor
Protestant).1

(ii) Dc = 1 for Catholic and 0 otherwise (i.e. non-Catholic)
Do = 1 for other religious affiliation (i.e. neither Catholic nor Protestant) and
0 otherwise (i.e. either Catholic or Protestant)

In this case Dc = 1 and Do = 0 imply Catholic, Dc = 0 and Do = 0 imply
Protestant, and Dc = 0 and Do = 1 imply other religious affiliation (i.e. neither
Catholic nor Protestant).

(iii) Dp = 1 for Protestant and 0 otherwise (i.e. non-Protestant)
Do = 1 for other religious affiliation (i.e. neither Catholic nor Protestant) and
0 otherwise (i.e. either Catholic or Protestant)

In this case Dp = 1 and Do = 0 imply Protestant, Dp = 0 and Do = 0 imply
Catholic, and Dp = 0 and Do = 1 imply other religious affiliation (i.e. neither
Catholic nor Protestant).

1 Note that Dc = 1 and Dp = 1 would mean Catholic and Protestant, which does not make sense. Similarly,
in the two other options the two dummy variables cannot be equal to one simultaneously.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11
2
b) Working shift (9 a.m.–5 p.m., 5 p.m.–1 a.m., and 1 a.m.–9 a.m.).

Again, there are three possible categories (shift 1, shift 2, shift 3), which can be
represented by two dummy variables. There are three options.

(i) D1 = 1 for shift 1 and 0 otherwise (i.e. shift 2 or 3)
D2 = 1 for shift 2 and 0 otherwise (i.e. shift 1 or 3)

In this case D1 = 1 and D2 = 0 imply shift 1, D1 = 0 and D2 = 1 imply shift 2,
and D1 = 0 and D2 = 0 imply shift 3 (i.e. neither shift 1 nor shift 2).

(ii) D1 = 1 for shift 1 and 0 otherwise (i.e. shift 2 or 3)
D3 = 1 for shift 3 and 0 otherwise (i.e. shift 1 or 2)

In this case D1 = 1 and D3 = 0 imply shift 1, D1 = 0 and D3 = 1 imply shift 3,
and D1 = 0 and D3 = 0 imply shift 2 (i.e. neither shift 1 nor shift 3).

(iii) D2 = 1 for shift 2 and 0 otherwise (i.e. shift 1 or 3)
D3 = 1 for shift 3 and 0 otherwise (i.e. shift 1 or 2)

In this case D2 = 1 and D3 = 0 imply shift 2, D2 = 0 and D3 = 1 imply shift 3,
and D2 = 0 and D3 = 0 imply shift 1 (i.e. neither shift 2 nor shift 3).

c) Supervisor (David Jones, Mary Brown, Rex Ralph and Kathy Smith).

This nominal /qualitative variable has four possible values/categories, which can be
represented by three dummy variables. There are four options. For example, defining the
‘base’ category as Kathy Smith:

DDJ = 1 for David Jones and 0 otherwise (Mary Brown or Rex Ralph or Kathy
Smith)
DMB = 1 for Mary Brown and 0 otherwise (i.e. David Jones or Rex Ralph or
Kathy Smith)
DRR = 1 for Rex Ralph and 0 otherwise (i.e. David Jones or Mary Brown or
Kathy Smith)

In this case DDJ = 1, DMB = 0 and DRR = 0 imply David Jones, DDJ = 0, DMB
= 1 and DRR = 0 imply Mary Brown, DDJ = 0, DMB = 0 and DRR = 1 imply Rex
Ralph, and DDJ = 0, DMB = 0 and DRR = 0 imply Kathy Smith.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11
3
Exercise 6 (Selvanathan et al., p. 846, ex. 19.9)

The director of a graduate school of business wanted to find a better way of deciding
which students should be accepted into the MBA program. Currently, the records of the
applicants are examined by the admissions committee, which looks at the undergraduate
grade point average (UGPA) and the MBA admission score (MBAA). The director
believed that the type of undergraduate degree also influenced the student’s MBA grade
point average (MBAGPA).

The most common undergraduate degrees of students attending the graduate school of
business are BCom, BEng, BSc and BA. Because the type of degree is a qualitative
variable, the following three dummy variables were created:

D1 = 1 if the degree is BCom and 0 if the degree is not BCom
D2 = 1 if the degree is BEng and 0 if the degree is not BEng
D3 = 1 if the degree is BSc and 0 if the degree is not BSc.

The director took a random sample of 100 students who entered the program two years
ago, and recorded for each student the MBAGPA, UGPA and MBAA scores and the
values of the D1, D2, D3 dummy variables. These data are saved in the t11e6 Excel file.

a) Using these data, estimate the following model

0 1 2 3 1 4 2 5 3MBAGPA UGPA MBAA D D D            

Does the model seem to perform satisfactorily? How do you interpret the slope
coefficients?

In this model there are five independent variables. UGPA and MBAA are quantitative
variables, while D1, D2 and D3 are dummy variables.2 The three dummy variables are
used to represent the type of undergraduate degree, which is a qualitative variable. They
are sufficient to distinguish the four undergraduate degrees, since for BCom D1 = 1, D2 =
0 and D3 = 0, for BEng D1 = 0, D2 = 1 and D3 = 0, for BSc D1 = 0, D2 = 0 and D3 = 1, and
for BA D1 = 0, D2 = 0 and D3 = 0.

Apart from the fact that some of the independent variables are dummy variables, this
regression model can be estimated with R the same way as any multiple regression
model. Hence, launch RStudio, create a new project and script, name them t11e6, import
the data from the t11e6 Excel file and execute the following commands

attach(t11e6)
m = lm(MBAGPA ~ UGPA + MBAA + D1 + D2 + D3)
summary(m)

to obtain

2 In R we cannot use subscripts, so we are going to denote these dummy variables as D1, D2 and D3.
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11
4

The adjusted coefficient of determination suggests that, taking the sample size and the
number of independent variables into consideration, this model can account for about
45% of the total sample variation of the MBA grade point average. This means that the
model does not fit to the data extremely well. Yet, the overall F-test rejects the null
hypothesis that all slope parameters are zero (p-value = 0), so the model is significant
overall.

The slope estimates of UGPA and MBAA are positive. This is acceptable since they imply
that a student’s expected MBA grade point average is an increasing function of her/his
undergraduate grade point average and MBA admission score. The actual values of these
slope estimates mean that, keeping the other independent variables, including the
dummy variables, constant,

i) if the undergraduate grade point average increases by one, the MBA grade point
average is expected to go up by 0.313, and
ii) by every additional MBA admission score the MBA grade point average is
expected to go up by 0.009.

The slope estimates of the D1, D2, D3 intercept dummy variables are also positive. Recall
that the three dummy variables represent BCom, BEng, and BSc, respectively, so BA is
the base category. Therefore, the slope estimates of the dummy variables indicate that,
keeping all other independent variables in the model constant, compared to the MBA
grade point average (MBAGPA) of a student with a BA first degree,

iii) the MBAGPA of a student with a BCom first degree is expected to be 0.922
higher,
iv) the MBAGPA of a student with a BEng first degree is expected to be 1.501 higher,
and
v) the MBAGPA of a student with a BSc first degree is expected to be 0.620 higher.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11
5
b) Test to determine whether individually each of the independent variables is linearly
related to MBAGPA.

The question implies the following hypotheses:

0 : 0 , : 0 ( 1,...,5)i A iH H i   

The p-values (Pr(| t | >)) of the first four slope coefficients are practically zero, and the fifth
one is about 0.021. Therefore, at the 2.5% level each independent variable has a
significant linear relationship with MBAGPA.

c) Is every slope estimate significantly positive?

The question implies the following hypotheses:

0 : 0 , : 0 ( 1,...,5)i A iH H i   

Since every slope estimate is positive and the p-value for a one-tail t-test is half of the
reported Pr(| t | >) value, we can reject every null hypothesis at the 1.1% or higher level
and conclude that each slope is significantly positive.

d) Can we conclude that, on average, a BCom graduate performs better than a BA
graduate?

Given the three dummy variables, BA is the base category. If BCom graduates tend to
perform better than BA graduates, then the coefficient of the BCom dummy variable (i.e.
D1) should be significantly positive. As we saw in part (c), it is significantly positive, so we
can conclude that on average BCom graduates outperform BA graduates.

e) Predict the MBAGPA of a BEng graduate with 3.0 undergraduate GPA and 700
MBAA score, first manually and then with R.

For a BEng graduate 1 2 30, 1, 0D D D   , and given an undergraduate GPA score of 3.0
and an MBAA score of 700, the predicted MBAGPA mark is

ˆ 0.437 0.313 3 0.009 700 0.922 0 1.501 1 0.620 0 8.303y             

To double-check this prediction, execute the following R commands:

newdata1 = data.frame(UGPA = 3, MBAA = 700, D1 = 0, D2 = 1, D3 = 0)
predict(m, newdata1, interval = "prediction")
predict(m, newdata1, interval = "confidence")

You get the following printouts:

and
Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414

L. Kónya, 2020, Semester 2 ECON20003 – Solutions 11
6

As you can see, the point prediction we calculated manually and the one reported by R
(i.e. fit) are slightly different. This is because, unlike R, we used only 3 decimals in the
calculation.

R also reports the 95% prediction and confidence intervals. The first implies that, with
95% confidence, the MBAGPA of a BEng graduate with 3.0 undergraduate GPA and 700
MBAA score is between 6.541 and 10.543, while the second implies that, with 95%
confidence, the average MBAGPA of all BEng graduates with 3.0 undergraduate GPA
and 700 MBAA score is between 7.500 and 9.584.

f) Repeat part (e) for a BA graduate with the same undergraduate GPA and MBAA
score.

For a BA graduate 1 2 30, 0, 0D D D   , so with the same undergraduate GPA and MBAA
scores than in part (e), the predicted MBAGPA mark is

ˆ 0.437 0.313 3 0.009 700 0.922 0 1.501 0 0.620 0 6.802y             

Execute the following R commands:

newdata2 = data.frame(UGPA = 3, MBAA = 700, D1 = 0, D2 = 0, D3 = 0)
predict(m, newdata2, interval = "prediction")
predict(m, newdata2, interval = "confidence")

You get the following printouts:

and

Hence, with 95% confidence, the MBAGPA of a BA graduate with 3.0 undergraduate
GPA and 700 MBAA score is between 5.083 and 8.999, and the average MBAGPA of all
BA graduates with 3.0 undergraduate GPA and 700 MBAA score is between 6.084 and
7.998.

Downloaded by Chen Jack (jackza.chen@gmail.com)
lOMoARcPSD|8583414