COMM5000 DATA LITERACY
Seminar 1 Week 2
Students must do their best to attempt all problems before attending the workshop.
You should come prepared to participate and engage in discussions with your peers to
solve these problems and to raise questions about the meaning and understanding of
the concepts covered in the prior week. How do you prepare for a workshop:
• Read the week’s scheduled content on Moodle
• Complement your understanding by referring to the assigned readings from the
reference eBook
• Take note of any unclear concept that is still unclear to you and discuss it with
your LIC during the synchronous lecture, consultation hours
• Attempt the workshop problem and bring your questions to the session to share
and discuss with your peers and your tutor.
Page 2
Seminar questions
1. (a) What is meant by a variable in a statistical sense? Distinguish between
qualitative and quantitative statistical variables, and between continuous and
discrete variables. Give examples.
(b) Distinguish between (i) a statistical population and a sample; (ii) a
parameter and a statistic. Give examples.
2. In order to know the market better, the second-hand car dealership, Anzac
Garage, wants to analyze the age of second-hand cars being sold. A sample of
20 advertisements for passenger cars is selected from the second-hand car
advertising/listing website www.drive.com.au The ages in years of the
vehicles at time of advertisement are listed below:
5, 5, 6, 14, 6, 2, 6, 4, 5, 9, 4, 10, 11, 2, 3, 7, 6, 6, 24, 11
(a) Calculate the frequency, cumulative frequency and relative frequency
distributions for the age data using the following bin classes:
- More than 0 to less than or equal to 8 years
- More than 8 to less than or equal to 16 years
- More than 16 to less than or equal to 24 years.
(b) Sketch a frequency histogram using the calculations in part (a). What can
you say about the distribution of the age of these second-hand cars? Is
there anything that concerns you about the frequency table and histogram?
Specifically, is the choice of bin classes appropriate? What needs to be
done differently?
(c) Halve the width of the bins (0 to 4, 4 to 8, etc) and recalculate the frequency,
cumulative frequency and relative frequency distributions. Using the new
distributions and histogram, what can you now say about the distribution
of the age of second-hand cars?
Excel exercise: Download the data “AnzacG.xls” from Moodle under “Excel data
for Tutorials”. Using the Data Analysis Tool Pack in Excel to generate the
histogram on Age and describe the age distribution of the second-hand cars
being sold.
Note: please use class midpoints instead of upper limits (Excel default) to
name the bins.
Page 3
3. Health expenditure
A recent report by Access Economics provides a comparison of Australian
expenditures on health with that of comparable OECD countries. Data from
that report relating to the year 2005 have been used to reproduce their Figure
2.2 (below denoted as Figure 2.1).
(a) What are the key features of these data?
(b) While this is a bivariate scatter plot, there are three variables involved:
health expenditure, GDP and population. Why account for population by
expressing health expenditure and GDP in per capita terms?
0
1
2
3
4
5
6
7
0 10 20 30 40 50 60 70
GDP per capita (US$000)
Figure 2.1 OECD Health Expenditure and
GDP
Page 4
4. Australian housing prices
Recent research by Dr Nigel Stapledon at the UNSW School of Economics
provides an extensive analysis of Australian housing prices since 1880. In Figure
2.2 his data are used to provide a comparison of Sydney and Melbourne housing
prices over time.
(a) What are the key features of these data?
(b) Why have prices been expressed in constant dollars?
5. Using the car data from Question 2:
(a) Calculate the mean, median and mode for this sample of data and use
these statistics to further describe the distribution of car ages.
(b) If the largest observation were removed from this data set, how would
the three measures of central tendency you have calculated change?
Excel exercise: Download the data “AnzacG.xls” from Moodle, generate the
descriptive/summary statistics on Age using Excel, and further describe the
distribution of car ages.
Page 5
6. For the following statistical population, compute the mean, range, variance
and standard deviation: 3, 3, 5, 12, 13, 14, 17, 20, 21, 21.
What would happen to each of the measures you have calculated if:
(a) …4 were added to each data point (observation)?
(b) …each data point was multiplied by 2?
7. Migrant wealth
Suppose the Minister for Immigration is interested in research on the
assimilation of migrant households (a household where the chief income-
earner is foreign born). The Household, Income and Labour Dynamics in
Australia (HILDA) survey is a representative survey of Australian households.
Using 4,669 household observations for 2002 from HILDA, we find there are
3,567 households classified as Australian-born and 1,102 classified as
migrants. One key consideration is how migrant households are doing in
terms of wealth compared with Australian-born households. Using these data,
we find the following:
Summary statistics for net household wealth ($A)
Mean 10th percentile Median 90th percentile
Australian-born 236,064 1,545 123,020 560,006
Migrant 248,970 1,720 131,152 524,372
(a) What can you say about the distribution of net household wealth, for both
Australian-born and migrant households, by looking at just the mean and the
median figures?
(b) More generally, what can you say about the distribution of wealth for
migrant households compared to that for Australian-born households? In
particular, which type of household has greater variation in wealth?
(c) Suppose the minister has net household wealth of $600,000. What can
you say about his or her financial circumstances relative to other Australian-
born households?
Page 6
8. Anzac Garage wants to develop guidelines for setting prices of cars according
to the car’s age. They hire a business consultant who chooses a sample of
117 second-hand passenger car advertisements collected from
www.drive.com.au and retrieves data on the age and price of the cars.
(a) The business consultant first calculates the correlation coefficient between
age and price and finds it to be -0.278. Interpret this result.
(b) Sketch what you think the scatter diagram from which this correlation
coefficient was calculated might look like. Suppose the business consultant
constructs a simple linear regression model using price as the dependent
variable, and age as the independent variable. What do you think the
estimated regression line might look like here? (We will return to this
particular example later in the course and address this question more
formally.)
Excel exercise: Download the data “AnzacG.xls” from Moodle, plot the
scatterplot and generate the correlation coefficient between age and price using
Excel.