xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

Python代写-ECMM422

时间：2021-03-01

Course Assessment 1

This course assessment (CA1) represents 40% of the overall module assessment.

This is an individual exercise and your attention is drawn to the College and University guidelines on collaboration and plagiarism, which are

available from the College website.

Note:

1. do not change the name of this notebook, i.e. the notebook file has to be named: ca1.ipynb

2. do not remove/delete any cell

3. do not add any cell (you can work on a draft notebook and only copy the function implementations here)

4. do not add you name or student code in the notebook or in the file name

Evaluation criteria:

Each question asks for one or more functions to be implemented.

Each question is awarded a number of marks.

A (hidden) unit test is going to evaluate if all desired properties of the required function(s) are met.

If the test passes all the associated marks are awarded, if it fails 0 marks are awarded. The large number of questions allows a fine grading.

Notes:

In the rest of the notebook, the term data matrix refers to a two dimensional numpy array where instances are encoded as rows, e.g. a

data matrix with 100 rows and 4 columns is to be interpreted as a collection of 100 instances each with four features.

When a required function can be implemented directly by a library function it is intended that the candidate should write her own

implementation of the function, e.g. a function to compute the accuracy or the cross validation.

Some questions are just a check-point, i.e. it is for you to see that you are correctly implementing all functions. Since those check-points use

functions that you have already implemented and that have already been marked, those questions are not going to be marked (i.e. they

appear as having marks 0).

In [ ]: %matplotlib inline

import matplotlib.pyplot as plt

import numpy as np

import scipy as sp

# unit test utilities: you can ignore these function

def is_approximately_equal(test,target,eps=1e-2):

return np.mean(np.fabs(np.array(test) - np.array(target)))

def assert_test_equality(test, target):

assert is_approximately_equal(test, target), 'Expected:\n %s \nbut got:\n %s'%(target, test)

Question 1 [marks 6]

a) Make a function data_matrix = make_data_classification(mean, std, n_centres, inner_std, n_samples,

random_seed=42) to create a data matrix according to the following rules:

1. mean is a n-dimensional vector (say [1,1], but the function should allow vectors of any dimension)

2. n_centres is the number of centres (say 3)

3. std is the standard deviation (say 1)

4. the centres are sampled from a Normal distribution with mean mean and standard deviation std

5. from each centre sample n_samples from a Normal distribution with the centre as the mean and standard deviation inner_std so

if mean=[1,1] n_centres=3 and n_samples=10 then the data matrix will be a 30 rows x 2 columns numpy array.

b) Make a function data_matrix, targets = make_data_regression(mean, std, n_centres, inner_std,

n_samples_list, random_seed=42) to create a data matrix and a target vector according to the following rules:

1. the data matrix is constructed in the same way as in make_data_classification

2. the targets are the Euclidean distance between the sample and the centre of the generating Normal distribution

See Question 3 for a graphical example of the expected output.

In [ ]: def make_data_classification(mean, std, n_centres, inner_std, n_samples, random_seed=42):

# YOUR CODE HERE

raise NotImplementedError()

def make_data_regression(mean, std, n_centres, inner_std, n_samples, random_seed=42):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 2 [marks 2]

a) Make a function data_matrix, targets = get_dataset_classification(n_samples, std, inner_std) to create a data

matrix and a target vector for a binary classification problem according to the following rules:

the instances from the positive class are generated according to the same rules provided for make_data_classification ; so are

the instances from the negative class

instances from the positive class have as mean the vector [10,10] and those from the negative class, vector [-10,-10]

the number of centres is fixed to 3

the random seed is fixed to 42

n_samples indicates the total number of instances finally available in the output data_matrix

b) Make a function data_matrix, targets = get_dataset_regression(n_samples, std, inner_std) to create a data

matrix according to the following rules:

the instances are generated according to the same rules provided for make_data_regression

the targets are generated according to the same rules provided for make_data_regression

instances have as mean the vector [10,10]

the number of centres is fixed to 3

the random seed is fixed to 42

n_samples indicates the total number of instances finally available in the output data_matrix

In [ ]: def get_dataset_classification(n_samples, std, inner_std):

# YOUR CODE HERE

raise NotImplementedError()

def get_dataset_regression(n_samples, std, inner_std):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 3 [marks 1]

Make a function plot(X,y) to display the scatter plot of a data matrix of two dimensional instances using the array y to assign the

colour to the instances.

When running

X, y = get_dataset_regression(n_samples=600, std=30, inner_std=5)

plot(X,y)

you should get something like

and when running

X, y = get_dataset_classification(n_samples=600, std=30, inner_std=5)

plot(X,y)

you should get something like

In [ ]: def plot(X,y):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 4 [marks 1]

Make a function classification_error(targets, preds) to compute the fraction of times that the entries in targets do not

agree with the corresponding entries in preds .

Note: do not use library functions to compute the result directly but implement your own version.

In [ ]: def classification_error(targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 5 [marks 2]

Make a function regression_error(targets, preds) to compute the mean squared error between targets and preds .

Note: do not use library functions to compute the result directly but implement your own version.

MSE = ( − .

1

n

∑

i=1

n

T

i

P

i

)

2

In [ ]: def regression_error(targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 6 [marks 7]

Make a function make_bootstrap(data_matrix, targets) to extract a bootstrapped replicate of an input dataset.

The function should return the following 6 elements (in this order): bootstrap_data_matrix, bootstrap_targets,

bootstrap_sample_ids, oob_data_matrix, oob_targets, oob_samples_ids , where:

1. bootstrap_data_matrix : is a data matrix encoding the bootstrapped replicate of the data matrix

2. bootstrap_targets : is the corresponding bootstrapped replicate of the target vector

3. bootstrap_sample_ids : is an array containing the instance indices of the bootstrapped replicate of the data matrix

4. oob_data_matrix : is a data matrix encoding the out of bag instances

5. oob_targets : is the corresponding out of bag instances of the target vector

6. oob_samples_ids : is an array containing the instance indices of the out of bag instances

In [ ]: def make_bootstrap(data_matrix, targets):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 7 [marks 10]

Consider the following functional blueprints estimator = train(X_train, y_train, param) and test(X_test,

estimator) . A function of type train takes in input a data matrix X_train a target vector y_train and a single value param

(not a list of parameters). A function of type train outputs an object that represent an estimator. A function of type test takes in input a

data matrix X_test the fit object estimator and outputs the predicted targets.

Using this blueprint, write the specialised train and test functions for the following classifiers and regressors (use the function signature

provided in the next cell, e.g. train_ab for training an adaboost classifier):

Classifiers:

a) k-nearest-neighbor: the parameter controls the number of neighbors (you may use KNeighborsClassifier from scikit) [train_knn,

test_knn]

b) adaboost: the parameter controls the maximal depth of the decision tree uses as weak classifier (you may use the

DecisionTreeClassifier from scikit but you should provide your own implementation of the boosting algorithm) [train_ab,

test_ab]

c) random forest: the parameter controls the maximal depth of the tree (you may use the DecisionTreeClassifier from scikit but you

should provide your own implementation of the bagging algorithm) [train_rfc, test_rfc]

Regressors:

d) decision tree: the parameter controls the maximal depth of the tree (you may use the DecisionTreeRegressor from scikit)

[train_dt, test_dt]

e) svm linear: the parameter controls the regularization constant C (you may use SVR from scikit) [train_svm_1, test_svm]

f) svm with a polynomial kernel of degree 2: the parameter controls the regularization constant C (you may use SVR from scikit)

[train_svm_2, test_svm]

g) svm with a polynomial kernel of degree 3: the parameter controls the regularization constant C (you may use SVR from scikit)

[train_svm_3, test_svm]

h) random forest: the parameter controls the maximal depth of the tree (you may use the DecisionTreeRegressor from scikit but you

should provide your own implementation of the bagging algorithm) [train_rf, test_rf]

For the algorithms adaboost and random forest , the size of the ensemble should be fixed to 100.

In [ ]: # classifiers

from sklearn.neighbors import KNeighborsClassifier

def train_knn(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_knn(X_test, est):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.tree import DecisionTreeClassifier

def train_ab(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_ab(X_test, models):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.tree import DecisionTreeClassifier

def train_rfc(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_rfc(X_test, models):

# YOUR CODE HERE

raise NotImplementedError()

# regressors

from sklearn.tree import DecisionTreeRegressor

def train_dt(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_dt(X_test, est):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.svm import SVR

def train_svm_1(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def train_svm_2(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def train_svm_3(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

#Note: you do not need to specialise the svm test function for each degree

def test_svm(X_test, est):

# YOUR CODE HERE

raise NotImplementedError()

from sklearn.tree import DecisionTreeRegressor

def train_rf(X_train, y_train, param):

# YOUR CODE HERE

raise NotImplementedError()

def test_rf(X_test, models):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 8 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all functions. Since this cell uses functions that you have

already implemented and that have already been marked, this Question is not going to be marked.

Make a dataset using

X, y = get_dataset_classification(n_samples=240, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

and check that the classification error for

k-nearest-neighbor

random forest classifier

adaboost

In [ ]: # Just run the following code, do not modify it

X, y = get_dataset_classification(n_samples=240, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

param=3

e_knn = classification_error(y_test, test_knn(X_test, train_knn(X_train, y_train, param)))

e_rfc = classification_error(y_test, test_rfc(X_test, train_rfc(X_train, y_train, param)))

e_ab = classification_error(y_test, test_ab(X_test, train_ab(X_train, y_train, param)))

print(e_knn, e_rfc, e_ab)

Question 9 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all functions. Since this cell uses functions that you have

already implemented and that have already been marked, this Question is not going to be marked.

Make a dataset using

X, y = get_dataset_regression(n_samples=120, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

and check that the regression error for these regressors

decision tree

svm with polynomial kernel of degree 2

svm with polynomial kernel of degree 3

is approximately comparable.

In [ ]: # Just run the following code, do not modify it

X, y = get_dataset_regression(n_samples=120, std=30, inner_std=10)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

param=3

e_dt = regression_error(y_test, test_dt(X_test, train_dt(X_train, y_train, param)))

e_svm2 = regression_error(y_test, test_svm(X_test, train_svm_2(X_train, y_train, param)))

e_svm3 = regression_error(y_test, test_svm(X_test, train_svm_3(X_train, y_train, param)))

print(e_dt, e_svm2, e_svm3)

Question 10 [marks 10]

Make a function sizes, train_errors, test_errors = compute_learning_curve(train_func, test_func, param, X,

y, test_size, n_steps, n_repetitions) to compute the train and test errors as mandated in the learning curve approach.

The regressor will be trained via train_func on the problem data_matrix , targets with parameter param . The estimate will be

done averaging a number of replicates equal to n_repetitions , i.e. the code needs to repeat the process n_repetitions times (say

10) and average the error.

Note that a fraction of the data as indicated by test_size (say 0.33 for 30%) is going to be reserved for testing purposes. The remaining

amount of data can be used in the training phase. The learning curve should be computed for an amount of training material that varies from

a minimum of 2 instances up to all the instances available for training.

You should use the function regression_error to compute the error.

Note: do not use library functions (e.g. learning_curve in scikit) to compute the result directly but implement your own version.

In [ ]: def compute_learning_curve(train_func, test_func, param, X, y, test_size, n_steps, n_repetitions):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 11 [marks 1]

Make a function plot_learning_curve(sizes, train_errors, test_errors) to display the train and test error as a function of

the size of the training set.

You should get something like:

In [ ]: def plot_learning_curve(sizes, train_errors, test_errors):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 12 [marks 3]

Make a function estimate_asymptotic_error(sizes, train_errors, test_errors) that returns an estimate of the asymptotic

error, i.e. the error made in the limit of an infinitely large training set.

In [ ]: def estimate_asymptotic_error(sizes, train_errors, test_errors):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 13 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all functions. Since this cell uses functions that you have

already implemented and that have already been marked, this Question is not going to be marked.

When you run:

X, y = get_dataset_regression(n_samples=800, std=30, inner_std=10)

train_func, test_func = train_dt, test_dt

param=5

sizes, train_errors, test_errors = compute_learning_curve(train_func, test_func, param, X, y, te

st_size=.3, n_steps=10, n_repetitions=100)

e = estimate_asymptotic_error(train_errors, test_errors)

print('Asymptotic error: %.1f'%e)

plot_learning_curve(sizes, train_errors, test_errors)

you should get something like

In [ ]: # Just run the following code, do not modify it

X, y = get_dataset_regression(n_samples=800, std=30, inner_std=10)

train_func, test_func = train_dt, test_dt

param=5

sizes, train_errors, test_errors = compute_learning_curve(train_func, test_func, param, X, y, test_size

=.3, n_steps=10, n_repetitions=100)

e = estimate_asymptotic_error(sizes, train_errors, test_errors)

print('Asymptotic error: %.1f'%e)

plot_learning_curve(sizes, train_errors, test_errors)

Question 14 [marks 6]

Make a function bias2, variance = compute_bias_variance(predictions_dict, targets) that takes in input a dictionary

of lists of predictions indexed by the instance index, and the target vector. The function should compute the squared bias component of the

error and the variance components of the error for each instance.

As a toy example consider: predictions_dict={0:[1,1,1], 1:[1,-1], 2:[-1,-1,-1,1]} and targets=[1,1,-1] , that is,

for instance with index 0 there are 3 predictions available [1,1,1] , instead for instance with index 1 there are only 2 predictions available

[1,-1] , etc. In this case, you should get bias2=[0. , 1. , 0.25] and variance=[0. , 1. , 0.75] .

In [ ]: def compute_bias_variance(predictions_dict, targets):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 15 [marks 10]

Make a function bias2, variance = bias_variance_decomposition(train_func, test_func, param, data_matrix,

targets, n_bootstraps) to compute the bias variance decomposition of the error of a regressor on a given problem. The regressor

will be trained via train_func on the problem data_matrix , targets with parameter param . The estimate will be done using a

number of replicates equal to n_bootstraps .

In [ ]: def bias_variance_decomposition(train_func, test_func, param, data_matrix, targets, n_bootstraps):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 16 [marks 2]

Consider the following regression problem (it does not matter that the target is only 1 and -1):

from sklearn.datasets import load_iris

def make_iris_data():

X,y = load_iris(return_X_y=True)

X=X[:,[0,2]]

y[y==2]=0

y[y==0]=-1

return X,y

Estimate the squared bias and variance component for each instance.

Consider as regressor a linear svm and a polynomial svm with degree 3.

What is the class of the instances that have the highest bias error on average?

In [ ]: # Just run the following code, do not modify it

from sklearn.datasets import load_iris

def make_iris_data():

X,y = load_iris(return_X_y=True)

X=X[:,[0,2]]

y[y==2]=0

y[y==0]=-1

return X,y

X,y = make_iris_data()

bias2, variance = bias_variance_decomposition(train_svm_1, test_svm, param=2, data_matrix=X, targets=y,

n_bootstraps=100)

print(np.mean(bias2[y==1]) , np.mean(bias2[y==-1]))

bias2, variance = bias_variance_decomposition(train_svm_3, test_svm, param=2, data_matrix=X, targets=y,

n_bootstraps=100)

print(np.mean(bias2[y==1]) , np.mean(bias2[y==-1]))

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 17 [marks 6]

Make a function bs,vs = compute_bias_variance_decomposition(train_func, test_func, params, data_matrix,

targets, n_bootstraps) to compute the average squared bias error component and the average variance component of the error for

each parameter setting in the vector params . The regressor will be trained via train_func on the problem data_matrix ,

targets with parameter param . The estimate will be done using a number of replicates equal to n_bootstraps . To be clear, the

vector bs contains the average square bias error for each parameter in params and the vector vs contains the average variance error

for each parameter in params .

In [ ]: def compute_bias_variance_decomposition(train_func, test_func, params, data_matrix, targets, n_bootstra

ps):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 18 [marks 1]

Make a function plot_bias_variance_decomposition(train_func, test_func, params, data_matrix, targets,

n_bootstraps, logscale=False) .

You should plot the individual components or the squared bias, the variance and the total error. You should allow the possibility to employ a

logarithmic scale for the horizontal axis via the logscale flag.

You should get something like:

In [ ]: def plot_bias_variance_decomposition(train_func, test_func, params, data_matrix, targets, n_bootstraps,

logscale=False):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 19 [marks 2]

Make a function find_best_param_with_bias_variance_decomposition(train_func, test_func, params,

data_matrix, targets, n_bootstraps) that uses the bias variance decomposition analysis to determine which parameter among

params achieves the smallest estimated predictive error.

In [ ]: def find_best_param_with_bias_variance_decomposition(train_func, test_func, params, data_matrix, target

s, n_bootstraps):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 20 [marks 6]

When you execute the following code

X, y = get_dataset_regression(n_samples=400, std=10, inner_std=7)

params = np.linspace(1,30,30).astype(int)

train_func, test_func = train_dt, test_dt

p = find_best_param_with_bias_variance_decomposition(train_func, test_func, params, data_matrix,

targets, n_bootstraps=60)

print('Best parameter:%s'%p)

plot_bias_variance_decomposition(train_func, test_func, params, data_matrix, targets, n_bootstra

ps=50, logscale=False)

You should get something like:

The next unit tests will run your functions

find_best_param_with_bias_variance_decomposition on an undisclosed dataset using as regressors:

decision tree

svm degree 3

and 3 marks will be awarded for each correct optimal parameter identified.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 21 [marks 5]

Make a function conf_mtx = confusion_table(targets, preds) to output the confusion matrix as a 2 x 2 Numpy array. Rows

indicate the prediction and columns the target. The cell element with index [0,0] should report the true positive count.

Running the following code:

from sklearn.datasets import load_iris

X,y = load_iris(return_X_y=True)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3)

models = train_knn(X_train, y_train, param=3)

preds = test_knn(X_test, models)

conf_mtx = confusion_table(y_test, preds)

print(conf_mtx)

you should obtain something similar to

[[16. 1.]

[ 0. 28.]]

Note: the exact values can differ in your run

Note: do not use library functions to compute the result directly but implement your own version.

In [ ]: def confusion_table(targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 22 [marks 1]

Make a function error_from_confusion_table(confusion_table_func, targets, preds) that takes in input the previous

confusion_table function and returns the error, i.e. the fraction of predictions that do not agree with the targets.

In [ ]: def error_from_confusion_table(confusion_table_func, targets, preds):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 23 [marks 12]

Make a function predictions, out_targets = cross_validation_prediction(train_func, test_func, param,

data_matrix, targets, kfold) that estimates the predictions of a classifier trained via the function train_func with parameter

param on the problem data_matrix, targets using a k-fold cross validation strategy with the number of folds indicated by kfold .

Since the order of the instances associated to the predictions can be different from the original order, the function is required to output also

the corresponding target values in the array out_targets (i.e. the value in position 10 in predictions corresponds to the target value

in position 10 in out_targets )

Note: do not use library functions (such as KFold or StratifiedKFold ) but implement your own version of the cross validation.

In [ ]: def cross_validation_prediction(train_func, test_func, param, data_matrix, targets, kfold):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 24 [marks 5]

Make a function mean_errors = compute_errors_with_crossvalidation(train_func, test_func, params,

data_matrix, targets, kfold, n_repetitions) that returns the estimated average error for each parameter in params . The

classifier is trained via the function train_func with parameters taken from params on the problem data_matrix, targets using

a k-fold cross validation strategy with the number of folds indicated by kfold . The error estimate is repeated a number of times indicated

in n_repetitions . The error should be computed using the function error_from_confusion_table . The output vector

mean_errors has as many entries as there are paramters in params .

Note: do not use library functions (such as cross_val_score ) but implement your own version of the code.

In [ ]: def compute_errors_with_crossvalidation(train_func, test_func, params, data_matrix, targets, kfold, n_r

epetitions):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 25 [marks 2]

Make a function find_best_param_with_crossvalidation(train_func, test_func, params, data_matrix, targets,

kfold, n_repetitions) that uses crossvalidation to determine which parameter among params achieves the smallest estimated

predictive error.

In [ ]: def find_best_param_with_crossvalidation(train_func, test_func, params, data_matrix, targets, kfold, n_

repetitions):

# YOUR CODE HERE

raise NotImplementedError()

In [ ]: # This cell is reserved for the unit tests. Do not consider this cell.

Question 26 [marks 0]

This is just a check-point, i.e. it is for you to see that you are correctly implementing all functions. Since this cell uses functions that you have

already implemented and that have already been marked, this Question is not going to be marked.

You should be able to run the following code:

from sklearn.datasets import load_wine

X,y = load_wine(return_X_y=True)

params = [3,5,7,9,11]

train_func, test_func = train_knn, test_knn

kfold = 5

n_repetitions = 5

best_param = find_best_param_with_crossvalidation(train_func, test_func, params, data_matrix, ta

rgets, kfold, n_repetitions)

print(best_param)

and get a value around 3.

In [ ]: # Just run the following code, do not modify it

from sklearn.datasets import load_wine

data_matrix, targets = load_wine(return_X_y=True)

params = [3,5,7,9,11]

train_func, test_func = train_knn, test_knn

kfold = 5

n_repetitions = 5

best_param = find_best_param_with_crossvalidation(train_func, test_func, params, data_matrix, targets,

kfold, n_repetitions)

print(best_param)