DTSC71-200-无代写-Assignment 2|学霸联盟

DTSC71-200-无代写-Assignment 2

时间：2023-11-21

Data Science DTSC71-200 Semester 233
Assignment 2 – KNN and Clustering (20% grade) Due: 3/12/2023
Assignment description
The same bank has hired you again, and the management now wants to analyse the same data you analysed before
for potential customers using yet another algorithm, KNN. They want to see if this algorithm can improve the
classification between the two groups of people. Also, they want to know if using clustering they could identify natural
groupings that would bring some insights into the classification of the samples.
The dataset can be found at UCI:
here https://archive-beta.ics.uci.edu/dataset/2/adult or here https://archive.ics.uci.edu/ml/datasets/Adult
As this is the same dataset used in assignment 1, you can use the same dataframe used before. However, you may
need to modify the features for the purposes of measuring the distances in order to train the KNN models and the K-
means models.
Your tasks in this assignment are:
- use KNN to come up with the best possible classification for the test set. The KNN classifier needs to classify
each instance into either >50K or <=50K for the predicted income. Remember that during training, only the
training data can be used to build the models. Classifier metrics (accuracy, precision, recall and ROC curves)
should be assessed using the test set.
- Use K-means (and/or K-medoids) to find natural clusters for the dataset, justifying the choice of the number
of cluster using the appropriate metrics (silhouette and elbow plots).
Deliverables: 2 pdf reports, and the corresponding rmd files
The submission should include 2 pdf files, both produced with rmd files (with different chunk options). The 2 pdf files
are described below:
- Technical Report: All the code and the results (including partial results) should be visible. This would be used
by the data analytics team at the bank.
- Management Report: a partial report with only the necessary items to help managers understand how the
model might work for them. The management report should have no more than 3 pages.
Remember to ensure that you use rmd files to produce both reports. Reports not created directly from the rmd may
be penalised.
Tips
You should try different K for each model. You need to think about how you are going to choose K, and how many
tests you need to do in order to achieve the best classifier.
- Compare the performance using different Ks and inform which one you adopted (and why).
- In the reports, explain the reasons you have chosen a particular threshold. Remember that often metrics can
contradict each other (e.g., total accuracy is different than recall or precision, which will favour one class
over the other).
- For the management report, try to use simple words (avoid jargon) and lots of graphs/tables (visualisations).
- You are allowed to reuse code from the workshops. You can also look for help on the Internet, but
remember to follow the “Academic Integrity Guidelines in Coding” (pdf file in the assessment section)
- For DTSC71-200, compare at least 5 different KNN models using accuracy, precision, recall and ROC curves
for the comparison. Compare at least 5 different K-Means models using silhouette coefficients and elbow
plots to compare them.
- The discussions and justification of which model should be used are very important.
Rubric

High Distinction (>=85%) Distinction (75~84%) Credit (65~74%) Pass (50~64%) Fail (<50%) Weight
Management
report
content
1. Adequate objectives are
clearly stated for the data
analysis.
2. A final model is clearly
presented with appropriate
justification.
3. The performance of the
models is discussed clearly and
thoroughly.
1. Objectives are stated for
the data analysis, with room
for improvement.
2. A final model is
presented and acceptably
justified.
3. The performance of the
models is discussed clearly
with room for improvement.
1. Sufficient Objectives
are mentioned but with
little detail.
2. A final model is
presented with little
justification.
3. The performance of the
model is discussed briefly.
1. Some Objectives are
mentioned.
2. A final model is
presented with no
justification.
3. The performance is
presented but not
discussed.
1. No Objectives are
mentioned.
2. There is no single
final model being
presented.
3. The performance of
the models is not
clearly reported.
30%
Management
report
Style and
Presentation
1. Visualisations are outstanding
and effectively used to highlight
important results.
2. Size and language are very
appropriate for an ML report.
1. Visualisations are
effectively used to highlight
most results.
2. Size and language are
reasonable for an ML
report.
1. Visualisations are used
but could be better
presented.
2. Size and language are
acceptable for an ML
report.
1. Visualisations are used
in a very limited way.
2. Size and language are
almost within the expected
quality of an ML report.
1. Visualisations
and/or tables are not
used.
2. Size and language
are not appropriate
for an ML report.
20%
Technical
report /
Modelling
and Analysis
1. Multiple models are
constructed and assessed for the
business task. Example
predictions are shown and
explained.
2. Data wrangling and EDA are
used and are appropriate for
each ML technique.
3. All the metrics required were
used to compare the models.
4. Correct use of train/test and
cross-validation in all models.
1. Multiple models are
constructed and assessed for
the task.
2. Data wrangling and EDA
are used and are mostly
appropriate for each ML
technique.
3. Most of the metrics
required were used to
compare the models.
4. Correct use of train/test
and cross-validation for
most of the models.
1. The reasoning for
multiple models is
unclear.
2. Data wrangling and
EDA are used, with room
for improvement.
3. Limited metrics were
used to compare the
models.
4. Correct use of train/test
and cross-validation for at
least some of the models.
1. Only one or two models
were built and assessed.
2. Data wrangling and
EDA are used, with some
flaws for specific ML
techniques.
3. Very limited metrics
were used to compare the
models, without
discussions.
4. Used train/test but no
cross-validation.
1. Only one model is
constructed, without
any other
consideration.
2. Major flaws in the
data used in
modelling.
3. No metrics were
used to compare the
models.
4. No cross-validation
and incorrect use of
train/test in
modelling.
40%
Technical
report /
Coding
1. All code is well organised in
functions or chunks in the
notebook, and the sequence is
very clear. Relevant comments
are included to highlight the
purpose of sections of code.
1. All code is well
organised in functions or
chunks in the notebook.
Relevant comments are
included to highlight the
purpose of sections of code.
1. The code is reasonably
organised in chunks. The
code is mostly
commented.
1. The code works, but it
is not very well organised.
The code is commented
on, with few instances of
irrelevant comments.
1. The code has major
flaws. Comments are
mostly absent or
irrelevant.
10%