30PM-随机森林代写-Assignment 2
时间:2023-05-18
Predictive Analytics and Machine Learning: Assignment 2
Department of Econometrics and Business Statistics, Monash University
Due Date: 24th May 2023 at 4:30PM
1 Data
Abalone is a sea dwelling animal also known as sea snail. This assignment uses abalone data that is based
on the study “Extending and benchmarking Cascade-Correlation”. Measuring the age of abalone is a time
consuming task that involves dissecting the animal and counting the number of rings through the shell (the
age in years is approximately equal to the number of rings plus 1.5). Instead, marine biologists use a set of
easy to collect attributes of each abalone to predict its age.
The task in this assignment is to provide a model to predict the age of abalone accurately based on such
predictors. The data consists of a total of 5000 abalones. The full dataset that you MUST use for this
assignment is available on Moodle under the name abalone.rds.
Data on the following variables is available.
Response variable
• age: age of the abalone in years.
Predictors
• sex : Whether Male, Female or Infant.
• length: Longest shell measurement.
• diameter: Perpendicular to length.
• height: Height of the abalone with meat in the shell.
• w_weight: Weight of the abalone with meat in the shell.
• shu_weight: Weight of meat.
• v_weight: Gut weight.
• she_weight: Shell weight.
2 Task
The task is to compare three methods that have been taught in this unit. You MUST choose one method
from each of the three following groups:
Group 1: Regression; Subset selection methods; Lasso; Ridge regression; Elastic net.
Group 2: Trees; k-Nearest neighbours.
Group 3: Bagging; Random forest; Boosting; Neural networks.
The idea is for you to compare the prediction performance of the three selected methods and make a case as
to which works best for this particular data problem.
1
It is your choice to select a training and validation sample, and to decide how to evaluate predictive
performance. Be clear about the steps you have followed and document each of these steps in your report.
You need to submit a short report of maximum 3000 words. Your R code and additional work not crucial to
the analysis can be included in an Appendix (this will not count towards the word limit).
3 Guidance
The assignment will be divided in three parts. To assist you, a list of questions are provided below. These
are designed to prompt you to think about the analysis and will influence the grading of the assignment. If
you can think of issues not listed here then you are encouraged to address them.
3.1 Data preparation (5 marks)
• Is the data clean? Are there missing values or outliers?
• Can you observe any patterns from simple exploratory analysis including summary statistics and basic
plots?
• Are all plots clearly presented and correctly explained?
• How can these patterns inform the models that you will choose?
• How will you ensure the data can be reproduced by somebody with knowledge of the techniques you
will use?
3.2 Description of the models (8 marks)
• Have you motivated the use of each of the selected methods?
• What are the parameters of the models?
• How are these parameters estimated?
• Are the limitations of these methods clearly discussed?
3.3 Model comparison (8 marks)
• Have you described the specific models that resulted after estimation?
• Have you clearly described how you selected the tuning parameters for each of the methods? For
instance, how did you pick the number of trees in a random forest, or how did you pick the number of
layers in a neural network, etc.
• How you employed any diagnostics after fitting the models?
• Have you discussed and motivated the accuracy measures that you will use?
• Have you clearly established which model is best terms of in-sample and out-of-sample accuracy?
• Figures and tables can be useful outputs for this section.
3.4 Conclusion (4 marks)
• Is the analysis robust to minor changes in the methodology?
• Are any assumptions made for the analysis or in drawing conclusions. If so, are these clearly explained?
• Does the report clearly summarises the findings from the analysis?
• Is your report a cohesive story with an interesting conclusion or does the report simply lists everything
that was attempted?
2
4 Submission
This assignment is a group assignment. The maximum group size is four people. You may form groups
with students from different tutorial groups and from different unit codes. A single soft copy should be
submitted with a group assignment cover page added to the front. All assignments should be submitted via
Moodle.
Peer review of your contribution to your team will be taken into consideration when marking
the assignment. As part of your workload for this unit you are expected to attend your
group meetings, produce code, help with the writing and actively contributing to your group.
Students that have allocated themselves to a group on Moodle but have not contributed to the
assignment solution submitted by the group MUST be reported to the chief examiner.