COMP4318/5318-无代写-Assignment 1|学霸联盟

COMP4318/5318-无代写-Assignment 1

时间：2023-09-14

Assignment 1
COMP4318/5318 Assignment 1 Overview
This assignment involves two core parts. Part 1 focuses on a classi�cation goal to identify cases of
heart disease in a Cardiovascular Disease Dataset. Part 2 focuses on a regression problem on the
Concrete Dataset.
Your tasks involve applying pre-processing techniques and writing functions to implement and
evaluate for each part. After providing these functions, you will perform grid search procedures to
tune for the best hyperparameter combinations for some of these models.
Each part has been broken down into 4 sections as follows:
1. Pre-processing techniques
2. Classi�cations/Regression methods
3. Ensemble methods
4. Hyperparameter tuning
Academic integrity
While the University is aware that the vast majority of students and sta� act ethically and honestly, it
is opposed to and will not tolerate academic integrity breaches and will treat all allegations seriously.
Further information on academic integrity, and the resources available to all students can be found
on the academic integrity pages on the current students website:
https://sydney.edu.au/students/academic-integrity.html.
Part 1: Classi�cation
In Part 1 of this assignment, you will be exploring di�erent classi�cation methods on a modi�ed
version of a real dataset - the Cardiovascular Disease Dataset. This dataset has a balanced selection of
the target class - Cardiovascular disease. The features of the Cardiovascular Disease Dataset are as
follows:
Age - positive int (days)
Height - positive int (cm)
Weight - positive �oat (kg)
Systolic blood pressure - positive int
Diastolic blood pressure - positive int
Gender - categorical [F, M]
Cholesterol - categorical [normal, above normal, well above normal]
Glucose - categorical [normal, above normal, well above normal]
Smoking - categorical [No, Yes]
Alcohol intake - categorical [No, Yes]
Physical activity - categorical [No, Yes]
Cardiovascular disease - categorical [No, Yes]
Your tasks for this part involve applying pre-processing techniques and writing classi�cation
functions that can be applied to this dataset using strati�ed 10-fold cross-validation.
After providing these functions, your next task is to design and test two functions on a range of
di�erent hyperparameters and evaluate their performance with strati�ed 10-fold cross-validation for
bagging and a validation set for Adaboost. You should use the cvkfold provided for all functions when
performing cross-validation.
Although it is not always necessary to wrap your code in functions when using Jupyter Notebooks,
this allows us to test your implementations. Wherever relevant, pass a random_state argument as
instructed below to control for randomness between runs and ensure your results are reproducible.
Further instructions can be found in the sca�old notebook.
Random state clari�cation:
There are multiple ways possible to ensure randomness is controlled between runs, however please
note there are some di�erences in the instructions for Part 1 and Part 2.
In Part 1, whenever there is a need to control for random events in a model, a random_state=0
argument should be passed to the constructor of the model inside your function, and it will not be
passed again during testing.
Part 1 extra instructions
If you fail a hidden test or running your code results in an error, ensure you have handled other
possible datasets within the speci�ed parameters and your functions can handle being called
with di�erent combinations of hyperparameters. You do not need to pass all hidden tests before
moving on to the next section.
Data:
The �lename to be loaded is 'cardio_diseases.csv'. You can �nd it in the �le browser by clicking
the button that looks like this
The �rst function load_data should handle setting any invalid values to np.nan and converting
categorical strings to numbers as outlined in the sca�old. The second function process_data
should process these missing values as described in the sca�old. Note the instructions in the
sca�old regarding testing your code on other datasets.
You can assume any valid categorical attributes in other datasets will take on values [Yes, No],
[Male, Female], and [Normal, Above normal, Well above normal] like the given dataset.
However, they may occur in di�erent columns, have di�erent names, or not be present at all.
Ensure your output array has a dtype of �oat64
Set the variables currently set to None by calling your functions. The values of these variables
will be checked along with your function to ensure you can use them in the rest of the
notebook.
All non-negative values (including 0) should be considered as valid numerical data
Evaluating methods with cross validation:
For the cross validation functions, use the skeletons provided, so that algorithm
hyperparameters such as number of neighbors and power of minkowski distance can
optionally be passed in as arguments and accessed as a dictionary using the **kwargs syntax.
Note you also can pass arguments as a dictionary to functions (such as sklearn constructors)
using the **kwargs syntax (you may need to search for documentation on **kwargs if
unfamiliar).
Your functions should create an instance of the model with the correct hyperparameters set
and run 10 fold cross validation (using the fold set up above and your preprocessed data X and
y) to evaluate it. The function should return the model instance with the correct
hyperparameters and the average cross validation accuracy obtained (but please do not round
this return value).
Once de�ned, you can treat cvKFold as a global variable and use it in all cells below.
Where the sca�old asks you to test your functions, please do so in the same cell and produce
output as below.
Follow the speci�ed formats for the outputs, rounded to 2 decimal places.
For the KNN the expected output format is as for the other methods
Cross validation score: x.xx
You can assume there will not be any cases where attributes shared by both bagging and
logistic regression are passed.
Grid search:
The train_val_test_split function should stratify the training/validation/test sets.
You can ignore this comment
TODO: uncomment this code to create the initial train test split
X_train_all, X_test, y_train_all, y_test = train_test_split(X, y, random_state=0)
Please read the sca�old instructions carefully for the Adaboost grid search. You are not
performing a cross validation grid search, but a grid search using a separate validation set. We
covered this procedure brie�y in the lectures. You calculate a validation score rather than
performing cross validation at each step.
The adaBoostGrid function returns the best model found, the best hyperparameter
combination, the best cross validation score and the test set score using this best model.
For the Adaboost grid searching, you should use LinearSVC instead of SVC, or it will timeout.
If you are timing out in Part 1, it is likely due to too many hyperparameters being tested for the
Adaboost grid search.
The param_grid should be passed into the adaBoostGrid function.
The adaBoostGrid function should call the train_val_test_split function to allow the hidden dataset
to be tested.
For the Adaboost search, please pass random_state=0 as part of the constructor for this model,
rather than as part of your hyperparameter tuning grid.
Your function should return the model with the best hyperparameter combination trained on
all of the training data, along with the best parameters, best validation set score, and test set
score of the best model.
You do not need to include any code in the �nal cell containing "test_adaboost_grid". It will also
test your code from the previous cell.
Your program should be able to check and pass any accepted hyperparameters to the relevant
function.
All used hyperparameters should be passed in the param_grid, even if they are not being
searched over, and should be passed as the function uses it (i.e. 'C' not 'base_estimator__C'):
e.g. param_grid = {'algorithm':['SAMME'], 'dual':['auto'], 'C':[...]}
--------------------------------------------
Details on the four tasks of this part are as follows:
1. Pre-processing techniques:
Replacing missing values
Min-max normalisation
2. Classi�cation methods:
K-Nearest Neighbours
Naive Bayes
Decision Trees
Support Vector Machine
3. Ensemble methods:
Bagging
Adaboost
4. Hyperparameter tuning:
Choose appropriate parameters
Test and return the highest-performing parameters for each method
IMPORTANT: Do not remove the ### comments in the sca�old or you will be unable to run the tests.
Do not rename your functions and use the same variable names when they are prescribed in the
instructions.
During marking, the Notebook.ipynb �le will be run cell by cell in order, except for cells containing
### SKIP.
Note on the dataset:
This is a modi�ed version of this dataset. Further details on the original Cardiovascular Disease
Dataset can be found at:
https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
https://www.kaggle.com/datasets/aiaiaidavid/cardio-data-dv13032020
Part 1 Questions
Question 1
Question 2
What would be the top 3 attributes selected with lasso regularization for the support vector machine
classi�er you created in the notebook, trained on the entire preprocessed data?
Age
Height
Weight
Ap_High
Ap_Low
Gender
Cholesterol
Glucose
Smoke
Alcohol
Physical_Activity
Which attributes of the heart disease dataset used are not appropriate to use in Gaussian Naive
Bayes?
Question 3
Age
Height
Weight
Ap_High
Ap_Low
Gender
Cholesterol
Glucose
Smoke
Alcohol
Physical_Activity
For this assignment, we are using a balanced version of the Cardiovascular Disease Dataset with
equal numbers in each class. However, the original dataset is unbalanced with a ratio of 9:1 of no
heart disease vs. heart disease.
If we were to use this unbalanced dataset, what would be the best two ways to evaluate our
classi�ers for trying to identify all who have heart disease?
Accuracy
Precision
Recall
F1-Score
Time taken
Root mean-squared error (RMSE)
Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Log-loss/Cross-entropy loss
Part 2: Regression
In this section, you will be performing regression on a modi�ed version of a dataset on Concrete
Compressive Strengths.
The task is to predict the concrete compressive strength (MPa) for a given mixture of a speci�c age
(days).
There are 8 numerical attributes representing the components of the mixture and its age:
Cement (component 1) -- kg in a m^3 mixture
Blast Furnace Slag (component 2) -- kg in a m^3 mixture
Variable Fly Ash (component 3) -- kg in a m^3 mixture
Water (component 4) -- kg in a m^3 mixture
Superplasticizer (component 5) -- kg in a m^3 mixture
Coarse Aggregate (component 6) -- kg in a m^3 mixture
Fine Aggregate (component 7) -- kg in a m^3 mixture
Age -- measured in days
Similar to the classi�cation section, you will need to write functions that implement, train, and
evaluate di�erent regression models, as well as functions to perform hyperparameter tuning.
Although it is not always necessary to wrap your code in such functions when using Jupyter
Notebooks, this allows us to test your implementations. Wherever relevant, pass a random_state
argument as 0 to control for randomness between runs and ensure your results are reproducible.
Further instructions can be found in the sca�old notebook.
-------------------------
Random state clari�cations:
There are multiple ways possible to ensure randomness is controlled between runs, however please
note there are some di�erences in the instructions for Part 1 and Part 2.
In Part 2, you should follow the sca�old instructions and pass the random_state=0 as a
hyperparameter/argument to your function when you test it, rather than set it inside your function.
During testing, a random_state=0 hyperparameter/argument will be passed to your functions
wherever required. When testing random forest, please pass a random_state=0
hyperparameter/argument to your function as for these other methods, although this was not
explicitly stated in the sca�old.
For the gradient boosting grid search, please pass random_state=0 as part of the constructor for this
model passed into GridSearchCV, rather than as part of your hyperparameter tuning grid. For
example, take a look at the following image from the Week 4 lab. The arrow indicates the constructor
of the model being tuned. You can pass a random_state=0 argument in the corresponding location
for this task.
Part 2 extra instructions
If you fail a hidden test or running your code results in an error, ensure you have handled other
possible datasets within the speci�ed parameters and your functions can handle being called
with di�erent combinations of hyperparameters.
Data:
The �lename to be loaded is 'concrete.csv'. You can �nd it in the �le browser by clicking the
button that looks like this
You can assume missing values will be encoded in the same way in the other datasets we will
test with.
Evaluating methods with cross validation:
Your functions need to return an instance of the model with the correct hyperparameters set,
along with the mean cross validation score obtained. Please do not round this returned CV score
value; only round it when printing the output.
Use the function skeletons provided, so that algorithm hyperparameters such as number of
neighbors and power of minkowski distance can optionally be passed in as arguments and
accessed as a dictionary using the **kwargs syntax. Note you also can pass arguments as a
dictionary to functions (such as sklearn constructors) using the **kwargs syntax (you may need
to search for documentation on **kwargs if unfamiliar).
We may test your functions with di�erent hyperparameters than we ask you to in your
notebook, so make sure to account for this
Once de�ned, you can treat cvKFold as a global variable and use it in all cells below.
Grid search:
As part of the output of your cross validation grid search, you need to return the �tted
GridSearchCV object. When we say "ensure the best model returned as part of the
CVGridSearch object has been trained on the entire training set" you can refer to the week 4 lab
for some guidance on what is meant, and how to access this model for testing.
Please set return_train_score=True in the relevant part for this task.
-----------------------------------------
Details on the four tasks of this part are as follows:
1. Pre-processing techniques:
Replacing missing values
Power transformation scaling
2. Regression methods:
Weighted KNN regression
Linear regression and regularised versions
Decision trees
3. Ensemble methods:
Random forest
4. Hyperparameter tuning
Cross-validation grid search to tune gradient boosting hyperparameters
IMPORTANT: Do not remove the ### comments in the sca�old or you will be unable to run the tests.
Do not rename your functions and use the same variable names when they are prescribed in the
instructions.
During marking, the Notebook.ipynb �le will be run cell by cell in order, except for cells containing
### SKIP.
Note on the dataset:
Sources:
Original Owner and Donor Prof. I-Cheng Yeh Department of Information Management Chung-Hua
University, Hsin Chu, Taiwan 30067, R.O.C. e-mail:icyeh@chu.edu.tw TEL:886-3-5186511
Acknowledgements, Copyright Information, and Availability:
NOTE: Reuse of this database is unlimited with retention of copyright notice for Prof. I-Cheng Yeh and the
following published paper:
I-Cheng Yeh, "Modeling of strength of high performance concrete using arti�cial neural networks,"
Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)
Part 2 Questions
Question 1 Submitted Sep 12th 2023 at 5:42:21 pm
Question 2 Submitted Sep 12th 2023 at 5:42:24 pm
Question 3 Submitted Sep 12th 2023 at 5:42:33 pm
Why might we have utilised PowerTransformer as opposed to MinMaxScaler for this task?
Some of the models require feature values outside of the 0-1 range (eg. negative inputs).
It is possible for MinMaxScaler to compress most of the feature values into a very small
range.
MinMaxScaler is slow and PowerTransformer is quicker.
Which of the following metrics penalises outliers more strongly?
Mean absolute error
Mean squared error
In the previous notebook (eg. in a cell containing ###SKIP), train a lasso model on the entire
preprocessed dataset with alpha=1 and random_state=0 (keep other hyperparameters at their default
value). Recall that lasso regularisation can act as a form of feature selection. Which features are
selected by this trained model?
Cement
Blast Furnace Slag
Variable Fly Ash
Water
Superplasticizer
Coarse Aggregate
Fine Aggregate
Age