MSCI562: Coursework 1 and 2 description
Assessment
1. Your ability to use correctly the tools that we covered in the course
2. Your ability to draw and justify the correct conclusions from these tools
3. Your ability to justify how conclusions/ findings from previous (modelling) steps lead you to the
actions/ choices you made in subsequent (modelling) steps, or the revision of previous decisions you
have made. Reports that don’t document the steps followed and the reasons why these were chosen
will receive minimal marks, even if the final answer/ recommended model is sensible. Explain your
reasoning clearly and in good English.
4. Your ability to address the questions posed in the coursework based on an intelligent interpretation of
the evidence provided in the previous two steps.
5. Your ability to express and justify your key findings succinctly. It is always possible to fill pages of
analysis with figures, tables and output from software. What is much more valuable is to be able to
convey convincingly the key findings and recommendations in a given page limit.
For all the above reasons: In your report do not just replicate the process followed during the workshops!
The objective of the workshops is to introduce you to the different techniques discussed during the lectures.
Workshops are not designed to provide you with a roadmap to answer the coursework.
Do not include screenshots from any software or any other information about commands you used, or
options to functions, or how you drew figures etc. You will be simply wasting valuable space.
Page limits
• Both pieces of coursework must be submitted as PDF files using at 11 point typeface.
• The first piece of coursework has a page limit of 8 pages.
• The second coursework has a page limit of 6 pages.
• Page limits are strict and they include appendices (which I strongly recommend that you do not use).
• If your report exceeds the page limit your mark will be affected negatively as you will be failing on the
last assessment criterion (see above).
Report Structure
• The coursework does not ask for a business report and as such it does not require an executive summary,
a cover page, table of contents, or even an introduction describing the context of the task.
• You do not need to outline the CRISP-DM process, discuss expected project benefits and risks, create a
cognitive map etc. This is not relevant in this case since you will not be collaborating with a “problem-
owner” in the process of specifying a data mining project and assessing its feasibility. In the present
setting the problem is already specified for you.
1
• It is essential in both courseworks to provide a Conclusions section that summarises your findings
and how these relate to the coursework objectives.
Plagiarism These are individual pieces of assessment, and you should ensure that your report reflects your
own work exclusively. All reports go through automated software to detect plagiarism from a variety
of sources (including past and current students’ reports as well as online resources, conference and journal
publications etc.) The consequences of plagiarism are very serious.
Deadlines Check the Moodle page of the course.
Late work will be penalised according to the department code of practice. Any request for an extension
beyond the deadline will be accepted if appropriate justification is provided in advance.
Problem Description
The objective of this report is to design a classifier that acts as a spam filter. Specifically, we will be
considering the spambase dataset, which was also used during one of the workshops. This dataset is also
contained in the textbook Elements of Statistical Learning (under the Data section); and this is the version
provided for this coursework.
The dataset consists of information from 4601 emails sent to Hewlett-Packard. The concept of “spam” is
very diverse: advertisements for products/web sites, make money fast schemes, chain letters, etc. Therefore
detecting spam is non-trivial. The dataset contains information for 58 variables overall. The last variable
“spam” is the class label (1: spam; 0: good email) we want to predict. All 57 predictor variables are
numerical.
• The first 48 predictors relate to the frequency of different words in an email.
• The next 6 relate to special characters (such as the dollar sign, parenthesis, exclamation marks, brack-
ets, etc).
• The last three contain information about consecutive capital letters in an email.
For a detailed description of these variables refer to the UCI repository documentation (this is a text file).
Overall objectives
Our primary interest is to understand the main features that distinguish spam from non-spam (good) email,
and to design an effective spam filter. Marking good mail as spam is very undesirable, therefore the spam
filter must minimise errors of this type.
Coursework 1: Exploratory data analysis and visualisation
The below list contains tasks/ issues that you should consider and be able to answer but it is very important
to understand that the list is not exhaustive. This means that the data may contain other interesting
features that are not mentioned or implied by any of these questions. It is your responsibility to identify
(any of) those.
• Which variables appear to be important for the task at hand, and why? Support your claims with
appropriate visualisations that document whether and how important each variable is.
• Is it possible to combine variables or consider interactions between variables in order to obtain a better
understanding of the relationships in the data, and/ or what distinguishes spam?
2
• Are different variables related, and which variables convey information similar to that provided in other
variable(s)?
The page limit on this report is such that it is impossible to include all the visualisations for each of the
57 predictors. You will therefore need to decide carefully what to include in the report. I expect to see a
justification of at least one variable that you deem to be “unimportant” (or less important) to ensure that
your argument is correct.
Consider PCA, MDS, and Isomap to produce visualisations of this data that aim to show the underlying
structure of the classes, and that could allow us to discuss whether classes are separable and how best to
separate them. Discuss choices that you made for MDS and Isomap and how these affect the resulting
visualisation. Discuss which visualisation is the most informative and why.
The coursework requires you to write a report explaining your findings. This means that you need to
explain each figure, table or “number” you include in the report. In other words including a relevant figure
but not explaining what are the conclusions from it will get you no marks. Do not forget that a Conclusions
section is required.
Coursework 2: Statistical Modelling
Perform the next steps using the training data exclusively.
• For the different classifiers we covered in the course attempt to develop appropriate models for this
problem.
• Discuss the approach and the different settings you used to develop such models for every type of
classifier. Emphasis should be placed on explaining why you considered the different choices/ settings
you made during the model development phase. (Variable selection method is part of this process
also.)
• For each classification method develop one or a few candidate models that you think are promising
before providing a final recommendation of the most appropriate model. You do not need to discuss
every model you tried in detail, but you must include the results for the important steps in the process
that led you to the final recommendations. I am particularly interested in understanding the steps
you followed and the justification for these. (Refer to the CRISP data mining process discussed in
Chapter 1 of the Guide to Intelligent Data Analysis).
• After using the training set data to develop different classifiers and perform parameter tuning and/ or
model selection for these, comment on the expected generalisation performance. In other words, how
well do you believe these models will perform when they are deployed.
• Marking good mail as spam is very undesirable, therefore the spam filter must minimise errors of this
type. Comment on which is the recommended classifier and how to use it to achieve:
1. No good email is classified as spam;
2. At most 2% of good email is classified as spam.
Does the recommended classifier change depending on the above goals? If no such objective was given
which classifier would you recommend, how would you justify this recommendation and how should
this classifier be used?
After you have decided on the recommended model for each type of classifier apply each such model on the
test data. Comment on the reported performance on the test set. Does this agree with what you expected?
If not how can you justify the discrepancies?
3
The coursework requires you to write a report explaining your findings. This means that you need to
explain each figure, table or “number” you include in the report. In other words including a relevant figure
but not explaining what are the conclusions from it will get you no marks. Do not forget that a Conclusions
section is required.
4
学霸联盟