程序代写案例-INF6028|学霸联盟

程序代写案例-INF6028

时间：2022-05-28

Information
School.

INF6028 Coursework 2021-22

Mining and Evaluating a Structured Dataset

1. Introduction

The assessment for INF6028 Data Mining consists of a single piece of individual coursework to assess
your ability to understand key data mining, analysis and evaluation concepts. You will be assigned a
single dataset with an associated data mining problem to solve (e.g., a regression problem). You should
first use data exploration techniques to explore the data, conduct appropriate data preparation, and
then choose two supervised data mining techniques available in KNIME to predict certain data values
and evaluate and compare their performance. You will need to select appropriate techniques, justify
your choices made at different stages of your workflow, and demonstrate that you have knowledge of
the necessary underlying data mining techniques.

You should write a 2,500 word structured report (see Section 3) that includes the following headings
(more details on how the report will be assessed are provided below):
• Introduction - introduce the prediction problem.
• Data mining theory - provide a theoretical description of the two supervised data mining methods
used in the workflow (for example, the classification or regression techniques that have been
used), why they are appropriate to the prediction task, and how their performance can be
assessed. This should include citations to relevant prior literature.
• Data exploration and preparation – describe the approaches used in the workflow to explore the
data; and perform feature selection, transformation and normalisation, where appropriate.
• Experimental setup - describe the experimental setup and the evaluation measures used in the
workflow and how the data has been handled to ensure that the models were not over-fitted.
You should explain which nodes were used in KNIME and provide a rationale for the various
parameter settings that were used. You should not, however, simply list all the modules in your
workflow and their parameters - be selective and discuss the modules most critical to solving the
data mining task.
• Results – present the results for each data mining method and compare the performance of the
different methods using graphical and tabular methods. What insights can you gain from the
models? For example, which are the most important features, are there any outliers in the
predictions?
• Conclusion and reflections – summarise the main findings of your report and reflect on the
methods used.
Charts and tables (and their associated captions), references and appendices are not included in the
word count.

Remember: your report should be a critical evaluation of the workflow in the context of the data mining
problem posed, it should not be merely a description of what was done.

This assessment is worth 100% of the overall module mark for INF6028. A pass mark of 50 is required to
pass the module. Submission deadline: 8th June via Turnitin. See Section 4 for more general information
about Coursework Submission Requirements within the Information School.

2. The Datasets and KNIME Workflows

You will be assigned a single dataset to base your analyses and report on. Please ensure before you
start working on the assessment that you are using the correct dataset.

The datasets have been derived from Kaggle competitions and are downloadable from Blackboard in the
Assessment section. A brief description of the attributes in each dataset is given at the end of this
document. Note that in both cases the data are different to the standard Kaggle datasets – they have
been extensively modified for this year’s run of INF6028. Do not attempt to use the datasets from
Kaggle or to use/copy any of the workbooks available there – this would constitute unfair means

Airline Passenger Satisfaction Dataset (Binary Classification)
The airline passenger satisfaction dataset consists of a single CSV file, which contains information
related to the reported satisfaction of airline passengers. The column “satisfaction” contains the
variable to be predicted.

House Prices Dataset (Regression)
The house prices dataset consists of a single CSV files containing details about 1,300 houses sold in the
city of Ames, Iowa. The column “SalePrice” contains the variable to be predicted.

3. Report Structure

You are required to produce a structured report that includes all the sections detailed in Table 1. You
must state the word count somewhere in the report. As there is a word count limit you should aim to
make your writing as concise and informative as possible. The emphasis of the report should be on the
clarity, accuracy and quality in communicating your findings. Where helpful, you may wish to state
specifically which KNIME nodes you have used but you should avoid simply listing nodes used and their
settings – be selective.

Table 1: Required content of the structured report.

Section Description

Maximum allocated marks
Structured
abstract
This should provide a summary of your report
in a structured manner. This is not included in
the word count.
Required, but 0 marks
Introduction This section should introduce the data mining
task that is addressed in the report. You
should indicate the property/data value that
is predicted and give a brief overview of the
dataset and methods used.
10 marks
Data Mining
Theory
This section should provide an overview of
the algorithms for predictive data mining
used in the workflow from a theoretical
aspect. Explain why they are relevant to the
prediction problem. Support your rationale
by providing references to the literature
25 marks

where the techniques have been applied to
similar problems.

Include a short discussion of the most
appropriate methods for evaluating the
performance of these data mining methods.
Data Exploration
and Preparation
This section should provide a brief
description of the data and of the approaches
used to pre-process the data. You should
present an investigation of the attributes
(including the data value to be predicted) and
describe any data cleaning employed,
including handling of missing data, data
transformations and data aggregations.
10 marks
Experimental
Setup
This section should describe the
experimental design in the workflow.

You should describe the process followed in
order to find the best performing model for
each method and how this was validated.

For example, which KNIME nodes were used?
How were they configured? Was any cross-
validation or a separate validation set used
and why?
20 marks
Results and
Discussion
Present the results of the data mining
process including the results of experiments
to find the best model for each data mining
method. Compare the best performance of
the different methods and, if appropriate,
consider which attribute contributes most to
each model.

Discuss the advantages and disadvantages of
the data mining methods. Which of the
chosen methods produced the best model
and why?
20 marks
Conclusion and
reflections
Summarise the main findings of the analysis
and reflect on the choice of methods for the
problem, for example, how might the models
be improved with hindsight? Use evidence
from the literature to support your
arguments.
15 marks
KNIME workflow You should submit your KNIME workflow(s)
as a “.knar” files. Note that this can consist of
separate workflows but they should all be
saved to one file. Include your best setup for
each data mining method.
Required, but 0 marks.
Note that 5 marks will be
deducted if this is not submitted
and it may make it difficult for
your marker to assess your
work.

Information School Coursework Submission Requirements

It is the student's responsibility to ensure no aspect of their work is plagiarised or the result of other
unfair means. The University’s and Information School’s Advice on unfair means can be found in your
Student Handbook, available via http://www.sheffield.ac.uk/is/current

Your assignment has a word count limit. A deduction of 3 marks will be applied for coursework that is
10% or more above or below the word count as specified above or that does not state the word count.

It is your responsibility to ensure your coursework is correctly submitted before the deadline. It is
highly recommended that you submit well before the deadline. Coursework submitted after 10am on
the stated submission date will result in a deduction of 5% of the mark awarded for each working day
after the submission date/time up to a maximum of 5 working days, where ‘working day’ includes
Monday to Friday (excluding public holidays) and runs from 10am to 10am. Coursework submitted
after the maximum period will receive zero marks.

Work submitted electronically, including through Turnitin, should be reviewed to ensure it appears as
you intended.

Before the submission deadline, you can submit coursework to Turnitin numerous times. Each
submission will overwrite the previous submission. Only your most recent submission will be assessed.
However, after the submission deadline, the coursework can only be submitted once.

Details about the submission of work via Turnitin can be found at http://youtu.be/C_wO9vHHheo

If you encounter any problems during the electronic submission of your coursework, you should
immediately contact the module coordinator and one of the Information School Teaching Support
Team is-teaching-support@shef.ac.uk (Julie Priestley 0114 2222839). This does not negate your
responsibilities to submit your coursework on time and correctly.

Airline Passenger Satisfaction Dataset (Binary Classification)
The airline passenger satisfaction dataset consists of a single CSV file, which contains information
related to the reported satisfaction of airline passengers. The column “satisfaction” contains the
variable to be predicted. The file has the following variables:

id: unique passenger/trip ID number
Gender: customer gender
Customer Loyalty: whether or not the customer is a loyalty scheme cardholder
Age: customer age
Type of Travel: business or personal trip
Class: ticket class
Online check-in: whether or not the customer was able to check in online
Flight Distance: length of flights in miles
Departure/Arrival time convenient: Satisfaction with departure/arrival times
Ease of Online booking: Satisfaction with online booking
Gate location: Satisfaction with gate location
Food and drink: Satisfaction with onboard food and drinks
Seat comfort: Satisfaction with level of seat comfort
Inflight entertainment: Satisfaction with inflight entertainment offering
On-board service: Satisfaction with level of onboard service
Leg room service: Satisfaction with amount of leg room
Baggage handling: Satisfaction with baggage handling
Checkin service: Satisfaction with the check-in service
Inflight service : Satisfaction with overall inflight service
Cleanliness: Satisfaction with cleanliness of the aircraft
Departure Delay in Minutes
Arrival Delay in Minutes

Satisfaction: Satisfaction with flight/airline (binary). This is the target variable that you're trying to
predict.

House Prices Dataset (Regression)
The house prices dataset consists of a single CSV files containing details about 1,300 houses sold in the
city of Ames, Iowa. The column “SalePrice” contains the variable to be predicted. The file has the
following variables:

MSZoning: The general zoning classification
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Grvl Gravel
Pave Paved

Alley: Type of alley access
Grvl Gravel
Pave Paved
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
Neighborhood: Physical locations within Ames city limits
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual: Overall material and finish quality (1 very poor – 10 excellent)
OverallCond: Overall condition rating (1 very poor – 10 excellent)
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
TotalBsmtSF: Total square feet of basement area
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
GrLivArea: Above grade (ground) living area square feet
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
KitchenAbvGr: Kitchens above grade
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
GarageType: Garage location

2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
WoodDeckSF: Wood deck area in square feet
PoolArea: Pool area in square feet

SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.