COMP9417 - -无代写
时间:2025-02-25
COMP9417 - Machine Learning
Homework 1: Cleaning, Fitting, and Optimizing
Introduction In this homework, we will explore key steps in preparing and applying machine learning
models, with a focus on logistic regression.
• In the first question, you will wrangle and clean a dataset, ensuring it is properly formatted and ready
for modeling.
• In the second question, you will take a deeper look at logistic regression, learning how to fit the model
and tune its hyperparameters using cross-validation.
• Finally, you will implement gradient descent to solve logistic regression, reinforcing your understand-
ing of optimization techniques in machine learning.
By the end of this assignment, you will have hands-on experience with data preprocessing, model fitting,
and optimization, all essential skills for building effective machine learning models.
Points Allocation There are a total of 30 marks.
• Question 1 a)- 1 h): 1 mark each
• Question 2 a): 3 marks
• Question 2 b): 4 marks
• Question 2 c): 5 marks
• Question 2 d): 4 marks
• Question 2 e): 3 marks
• Question 2 f): 3 marks
What to Submit
• A single PDF file which contains solutions to each question. For each question, provide your solution
in the form of text and requested plots. For some questions you will be requested to provide screen
shots of code used to generate your answer — only include these when they are explicitly asked for.
• .py file(s) containing all code you used for the project, which should be provided in a separate .zip
file. This code must match the code provided in the report.
• You may be deducted points for not following these instructions.
1
• You may be deducted points for poorly presented/formatted work. Please be neat and make your
solutions clear. Start each question on a new page if necessary.
• You cannot submit a Jupyter notebook; this will receive a mark of zero. This does not stop you from
developing your code in a notebook and then copying it into a .py file though, or using a tool such as
nbconvert or similar.
• We will set up a Moodle forum for questions about this homework. Please read the existing questions
before posting new questions. Please do some basic research online before posting questions. Please
only post clarification questions. Any questions deemed to be fishing for answers will be ignored
and/or deleted.
• Please check Moodle announcements for any updates to this spec. It is your responsibility to check
for announcements about the spec.
• Please complete your homework on your own, do not discuss your solution with other people in the
course. General discussion of the problems is fine, but you must write out your own solution and
acknowledge if you discussed any of the problems in your submission (including their name(s) and
zID).
• As usual, we monitor all online forums such as Chegg, StackExchange, etc. Posting homework ques-
tions on these site is equivalent to plagiarism and will result in a case of academic misconduct.
• You may not use SymPy or any other symbolic programming toolkits to answer the derivation ques-
tions. This will result in an automatic grade of zero for the relevant question. You must do the
derivations manually.
When and Where to Submit
• Due date: Week 4, Thursday March 13th, 2025 by 5pm. Please note that the forum will not be actively
monitored on weekends.
• Late submissions will incur a penalty of 5% per day from the maximum achievable grade. For ex-
ample, if you achieve a grade of 80/100 but you submitted 3 days late, then your final grade will be
80− 3× 5 = 65. Submissions that are more than 5 days late will receive a mark of zero.
• Submission must be made on Moodle, no exceptions.
Page 2
In this question we will work with the data heart.csv. The data contains information about a set of 100
patients. The goal is to use this information to predict whether or not a patient has heart disease.
Question 1. Data Wrangling
The data in its current form is not ready for modeling. We will need to wrangle (clean) the data. This
is outlined in the following tasks. Throughout this question, you may only use python. For each sub-
question, provide commentary (if needed) along with screenshots of code used. Please also provide
a copy of the code in your solutions.py file.
(a) Create a variable X containing only features, and a variably y containing the target (Heart Disease).
Remove the Last Checkup feature.
(b) For Age, some values are negative. You are informed that this is a data entry error and negatives
should be replaced with their positive versions. That is, −x should be replaced with x.
(c) For Gender and Smoker, the variables have been coded in inconsistent ways. For example, Fe-
male gender is encoded as ‘Female’ and as ‘F’. Write code to make these codings consistent. Then
use categorical encoding instead. For gender, map (Male/M,Female/F, Unknown) to (0,1,2). For
Smoker, map (No/N,Yes/Y,Nan) to (0,1,2).
(d) Blood pressure is given in the form systolic/diastolic. Write code to create two variables, systolic
and diastolic. Remove the original blood pressure variable.
(e) Using sklearn.model selection.train test split, split the data into training and test
sets. Set the test size parameter to 0.3, and the random state to ‘2’ for reproducibility.
(f) Now, note that some values in ‘Age’ in your test data are missing. We will manually impute values
according to the following rule: If a Male (Female) is missing their age, set it to be the median of
all other Male (Female) patients. Note that you should NOT use test data for this step, so your
medians should be computed based on training set data. Be careful not to include missing values
when calculating your median.
(g) Scale the columns: ’Age’, ’Height feet’, ’Weight kg’, ’Cholesterol’, ’Systolic’, ’Diastolic’ using a
min-max normalizer. This means that for each feature, you should replace x with
x−min
max−min ,
where min,max are the minimum and maximum values for that feature column, respectively.
Make sure to do this separately for train and test data.
(h) Plot a histogram of your target variable (from your training data). You should notice that a large
portion of the target value is clustered around zero. Do you think linear regression is a reasonable
model for this data? Create a new target variable by quantizing the original target variable. You can
do this by setting values below a certain threshold (say 0.1) to be 0 and those above the threshold
to be 1.
Question 2. Regularized Logistic Regression
Recall that Regularized Logistic Regression is a regression model used when the response variable is
binary valued. Instead of using mean squared error loss as in standard regression problems, we instead
minimize the log-loss, also referred to as the cross entropy loss. For an intercept β0 ∈ R, parameter
vector β = (β1, . . . , βp)T ∈ Rp, target yi ∈ {0, 1}, and feature vector xi = (xi1, xi2, . . . , xip)T ∈ Rp for
i = 1, . . . , n, the (ℓ2-regularized) log-loss that we will work with is:
L(β0, β) = penalty(β) +
λ
n
n∑
i=1
[
yi ln
(
1
σ(β0 + βTxi)
)
+ (1− yi) ln
(
1
1− σ(β0 + βTxi)
)]
, (1)
Page 3
where σ(z) = (1 + e−z)−1 is the logistic sigmoid, and λ is a hyper-parameter that controls the amount
of regularization. The penalty term is a regularizer, for example we could take penalty(β) = 12 ∥β∥22.
Note that you are provided with an implementation of this loss in helper.py.
(a) Consider the sklearn logistic regression implementation (section 1.1.11), which claims to mini-
mize the following objective:
wˆ, cˆ = argmin
w,c
{
penalty(w) + C
n∑
i=1
log(1 + exp(−y˜i(wTxi + c)))
}
. (2)
It turns out that this objective is identical to our objective above (when the same penalty function is
used), but only after re-coding the binary variables to be in {−1, 1} instead of binary values {0, 1}.
That is, y˜i ∈ {−1, 1}, whereas yi ∈ {0, 1}. Argue rigorously that the two objectives are identical,
in that they give us the same solutions (βˆ0 = cˆ and βˆ = wˆ). Further, describe the role of C in the
objectives, how does it compare to the standard Ridge parameter λ as you have seen in the class?
What to submit: some commentary/your working.
(b) Create a grid of 100 C values using the code np.logspace(-4, 4, 100). For each C, fit a lo-
gistic regression model (using the LogisticRegression class in sklearn) on the training data.
Plot a series showing the train and test log-losses against C. Be sure to use predict proba to
generate predictions from your fitted models to plug into the log-loss. Also, use ℓ2 regularization,
and the lbfgs solver when fitting your models. Discuss the shape of the two loss curves. How
would you pick C based on these plots? State your choice of C.
(c) In this part, we will take a closer look at choosing the hyperparameter C. Specifically, we will per-
form cross validation from scratch (Do not use existing cross validation implementations here,
doing so will result in a mark of zero.)
Create a grid of C values as before. For each value of C in your grid, perform 5-fold cross vali-
dation (i.e. split the train data into 5 folds, fit logistic regression (using the settings from before)
with the choice of C on 4 of those folds, and record the log-loss on the 5th, repeating the process
5 times.) For this question, we will take the first fold to be the first N/5 rows of the training data,
the second fold to be the next N/5 rows, etc, where N denotes the number of observations in the
training data.
To display the results, we will produce a plot: the x-axis should reflect the choice of C values
and the y-axis will be the log-loss. For each C, plot a box-plot over the 5 CV scores. Report the
value of C that gives you the best CV performance in terms of log-loss. Re-fit the model with this
chosen C, and report both train and test accuracy using this model. Note that we do not need to use
the y˜ (y˜i ∈ {−1, 1}) coding here (the sklearn implementation is able to handle different coding
schemes automatically) so no transformations are needed before applying logistic regression to the
provided data. What to submit: a single plot, train and test accuracy of your final model, a screen shot of
your code for this section, a copy of your python code in solutions.py
(d) In this part we will compare our results in the previous section to the sklearn implementation of
gridsearch, namely, the GridSearchCV class. My initial code for this section looked like:
1 from sklearn.model_selection import GridSearchCV
2 param_grid = {’C’: Cs}
3 grid_lr = GridSearchCV(estimator=
4 LogisticRegression(penalty=’l2’,
5 solver=’lbfgs’),
Page 4
6 cv=5,
7 param_grid=param_grid)
8 grid_lr.fit(X_train, y_train_q)
9
However, this gave me a very different answer to the result obtained by hand. Provide two reasons
for why this is the case, and then, if it is possible, re-run the code with some changes to give
consistent results to those we computed by hand, and if not, explain why. It may help to read
through the documentation. What to submit: some commentary, a screen shot of your code for this
section, a copy of your python code in solutions.py
(e) Suppose that you were going to solve (1) using gradient descent and chose to update each coor-
dinate individually. Derive gradient descent updates for each of the components β0, β1, . . . , βp for
step size η and regularization parameter λ. That is, derive explicit expressions for the terms † in
the following:
β
(k)
0 = β
(k−1)
0 − η × †
β
(k)
1 = β
(k−1)
1 − η × †
β
(k)
2 = β
(k−1)
2 − η × †
...
β(k)p = β
(k−1)
p − η × †
Make your expression as simple as possible, and be sure to include all your working. what to
submit: your coordinate level GD updates along with any working.
(f) For the non-intercept components β1, . . . , βp, re-write the gradient descent updates of the previous
question in vector form, i.e. derive an explicit expression for the term † in the following:
β(k) = β(k−1) − η × †
Your expression should only be in terms of β0, β, xi and yi. Next, let γ = [β0, βT ]T be the (p + 1)-
dimensional vector that combines the intercept with the coefficient vector β, write down the update
γ(k) = γ(k−1) − η × †.
Note: This final expression will be our vectorized implementation of gradient descent. The point
of the above exercises is just to be careful about the differences between intercept and non-intercept
parameters. Doing GD on the coordinates is extremely innefficient in practice. what to submit: your
vectorized GD updates along with any working.
Page 5

学霸联盟
essay、essay代写