1
ST952 INTRODUCTION TO STATISTICAL PRACTICE 2021
ASSESSED COURSEWORK 1
Deadline: 9pm on Thursday, 11th November (week 6).
Your reports should be submitted electronically on Moodle together with individual peer
review forms‡ (see details below).
Details of the Dataset
Automobile dataset: These data are taken from the dataset available on the UCI web
site1 maintained by Blake and Merz(1998)2 . The data were donated by
Schlimmer(1985)3 The original dataset contained 26 variables but you are required to
analyse a selection of them. The task is to build a suitable regression model to predict
the MPG and to provide some interpretation of the model parameters.
Table 1: Description of the Data
Name Description Details (Units or categories)
MPG Miles per Gallon MPG miles per gallon
Doors Number of Doors
(excluding boot)
two
four
2
4
Body Body type wagon
sedan
convertible
hardtop
hatchback
in UK an estate car
in UK a saloon
retracting roof
removable roof/coupe
rear hinged boot at top
Drive Drive System
fwd
rwd
4wd
Front Wheel Drive
Rear Wheel Drive
Four Wheel Drive
EngLoc Engine Location front
rear
Engine is in front
Engine is in the rear
WhlBase wheel base distance between front
and rear wheels
inches
Len Length Length of Car inches
Wid Width Width of Car inches
Ht Height Height of Car inches
CurbWt Curb Weight Kerb weight is the weight
of the car with all brake
fluid, oil, water and some
fuel, i.e. without
passengers but ready to
drive.
lbs
Table continued on next page.
1 http://www.ncc.up.pt/~ltorgo/Regression/DataSets.html
2 Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning databases
[http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information
and Computer Science.
3 Schlimmer Jeffrey C. Abstract from Ward’s 1985 Automotive Year book
https://archive.ics.uci.edu/ml/datasets/Automobile
ST952: Assignment 1
2
Table 1 Continued
Cyl Number of Cylinders two
three
four
five
six
eight
twelve
2
3
4
5
6
8
12
EngSiz Engine Size Engine Size cubic inches
FuelSys Fuel System 1bbl
2bbl
4bbl
idi
mfi
mpfi
spdi
spfi
one barrel carburettor*
two barrel carburettor system*
four barrel carburettor system*
diesel injection*
mechanical fuel injection*
multi port fuel injection*
single point direct injection*
sequential point fuel injection *
Bore Cylinder diameter Diameter of each cylinder inches
Stroke Stroke Length Length of piston
movement in the cylinder
inches
CompRatio Compression Ratio Ratio of maximum to
minimum volume in the
cylinder during the piston
cycle
Horsepower A measurement of
power
Equivalent to 747.7 Watts Came about when there was a
need to compare the work of a
steam engine to that of a draft
horse.
PeakRPM Peak Revs per minute The revolutions per
minute at which engine is
performing at its peak
count per minute
Price Price of the Car Data take from 1985 so
assume price is as in
1985
US dollars ($)
* Meaning not given on data description but possible meaning from exploring car specifications
Availability of the data
The data are available on Moodle as an R data frame called AutoUSA85.Rdata
Analysis Required
Main question: how is the MPG of an 1980’s car related to its other
characteristics and is it possible to use a regression model to reliably
predict the MPG?
You are to conduct an analysis of this dataset in R, in the groups to which you have been
assigned. These can be found on the Moodle area for this assignment. An outline of the
steps you should take in your analysis is given below.
ST952: Assignment 1
3
Begin with an exploratory analysis of the data. Using appropriate numerical, tabular
or graphical summaries, describe the distribution of the variables and investigate
potential relationships.
Use regression models to investigate the relationship between the explanatory
variables and the dependent variable MPG. Consider whether the
response/explanatory variables should be transformed and pay attention to the
possible existence of outliers. Select a clear final model and your rationale for doing
so.
Fully investigate the validity of the model and any other potentials issues there may
be with the data or the model. Clearly comment on your conclusions and the success
of your model.
Illustrate the usefulness of the model by giving an interpretation of the parameters
and by demonstrating how the MPG of a car may be predicted from the remaining
variables. The aim of this explanation should be understandable to a non-specialist.
Write a joint report on your findings, describing the data, your analyses and your
conclusions.
Further Points
The aim of this assignment is to demonstrate what you have learnt in fitting a multiple
regression model using a statistical approach. This assignment is a statistical
assignment and NOT one on machine learning. You should use techniques that we have
covered in the module. You are welcome to investigate further techniques that relate to
regression analyses and if appropriate use those, but this is not a requirement. There is
no need, for example, to split the data into training and test sets as you should not be
using methods that require this (e.g. a neural network). Your assessment of the model
should be based on the regression model output and the diagnostics you carry out in
question 3
ST952: Assignment 1
4
Report Requirements
The report should be submitted as a PDF file and should consist of two parts:
a) The main part for the report itself (there is a maximum of 10 typeset A4 sides
including graphs, tables and references; minimum body-text font size 11pt, and
minimum 2cm margins all round). This document should be written for intelligent
readers who do not necessarily have advanced statistical training. It should be
neat and professional. Figures should be clearly labelled and referenced. There
should be suitably number headings and sub-headings.
b) A technical appendix at the end of the report giving the commented R code which
was used in order to allow the analysis to be reproduced.
The report should:
a) give only your student numbers, not names.
b) be submitted to the group assignment area on Moodle.
Finally, each group member should submit a completed peer review form to the
“Contribution” section for this assignment on Moodle. Failure to do so may result in
you losing credit, even if your fellow group members give you full credit for your
contribution. You must not discuss what you have filled in on this form with the other
members of your group. The peer review form can be downloaded from Moodle. It
contains further details of what you need to fill in. Based on the peer review forms
from your other group members the individual mark you obtain will be the group mark
for this assignment but potentially adjusted slightly up or down. You should aim to be
honest and realistic in filling in this form.
ST952: Assignment 1
5
Mark Scheme:
This assessment is worth 25% of your final mark on ST952. The assessment will be
based on your understanding of the problem, the competence of your analysis and the
presentation of your report. The report will be marked out of 100 and then weighted with
your other marks from the second assignment and the exam. Your mark for both
assignments averaged together must be above 50% to pass the coursework component
of the module. You must pass both the coursework component and the exam to pass the
module.
Marks for the actual analysis will be a maximum of 85 but different aspects are linked
(e.g. to find an appropriate model it may be necessary to redo some types of initial plots,
or as a result of diagnostics to revisit the model etc.) so marks may be slightly higher or
lower in some categories as appropriate to your approach. The marks below are
therefore meant as a rough guide
(Question 1) Initial investigation of Data 20-25 marks
(Question 2) Appropriate and well explained statistical analysis and investigation to
find final model, including use of transformations and interpretation of
tabular output, 25-30 marks
(Question 3) Residual, influential and any other diagnostics , 20-25 marks
(Question 4) Interpretation of final model 10-15 marks
(Question 5) Report structure and presentation (including quality of tables and figures,
professionality, use of numbered headings, page numbers, contents
page, figure labels etc.): 7 marks
Appropriate use of English language (including spelling and grammar,
clarity, avoidance of statistical terms in their colloquial sense (e.g.
significant):
8 marks
学霸联盟