julia代写-FTEC2101/ESTR2520

FTEC2101/ESTR2520 Optimization Methods Spring 2022
Project Specification – Binary Classification
Last Updated: April 25, 2022, Deadline: May 13, 2022, 23:59 (HKT)
Thus far we have learnt a number of optimization methods, ranging from the simplex method for linear
programming, modeling techniques for integer programming, gradient/Newton methods for unconstrained
optimization, KKT conditions, SOCP formulation, etc.. Besides theories, please remember that optimization
methods are practical tools for solving real world problems. The aim of this project is to practice our skill on
the latter aspect. We will make use of the Julia1 environment.
The project’s theme is about practical solutions for binary classification. It can be considered as an extension to
what we have experimented in Computing Lab 2. We will explore additional aspects about binary classification
and more importantly, implement and compare these ideas on a real-world dataset.
Background. Let us recall that the binary classification task aims at learning a linear classifier that distin-
guishes a feature vector as positive (+1) or negative (−1). Let d ∈ N be the dimension of the feature vector,
the (linear) classifier is described by the tuple (w, b) where w ∈ Rd is the direction parameters and b ∈ R is
the bias parameter2, both specifying the linear model. Given (w, b), a feature vector x ∈ Rd is classified into
a label y ∈ {±1} such that
y =
{
+1, if w>x+ b = w1x1 + · · ·+ wdxd + b ≥ 0,
−1, if w>x+ b = w1x1 + · · ·+ wdxd + b < 0.
(1.1)
As an illustration, the following figure shows a case with d = 2 such that the classifier is described by
w = (w1, w2) and the scalar b:
I
"
\ ✗ ✗ ✗
ii.
+
+

1-

\ ✗
y l
l
✗ X '
e Wixitwzlz -112--0
y ✗ ,
i
,
2×1
According to the rule in (1.1), all the feature vectors that lie above the dashed line shall be classified as +1
(blue points); those that lie below the dashed line shall be classified as −1 (red points).
1As an alternative, you are welcomed to use Python with optimization modeling packages supporting SOCP and MI-NLP such
as cvxpy. However, you have to notify the instructor about the choice and the plan on how you wish to accomplish the project
in the latter case on or before May 1, 2022, i.e., two weeks before the deadline. In the project, you are not allowed to use any
pre-built solver for binary classification such as MLJ.jl, ScikitLearn.jl, flux.jl scikit-learn as your final submission (though
you are encouraged to try out these packages as extra work to support your solution).
2Note that this differs slightly from Lecture 16 as we include the bias parameter in the classifier design.
1
FTEC2101/ESTR2520 Project 2
Dataset & Folder Setting We have m ∈ N samples of training data described by {x(i), y(i)}mi=1, where
x(i) ∈ Rd is the ith feature vector and y(i) ∈ {±1} is the associated label. We will use a curated version
of the GiveMeSomeCredits dataset taken from https://www.kaggle.com/c/GiveMeSomeCredit. It includes
customer data such as their debt ratios, credit utilization, age, etc., and the history of defaulting. With these
information, our goal is to learn a classifier that predicts if a customer has any risk of defaulting before the
bank approves loan(s) for him/her.
To prepare for the project, please retrieve the Jupyter notebook template ftec2101-project-template-2022.ipynb
and the zip archive ftec project files.zip from the Blackboard. Place the *.ipynb files in a working direc-
tory, extract the *.zip file and move the *.csv files inside the same working directory. Your working directory
should have the following content:
Notice that:
• ftec-cs-groupi-train.csv – the training dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv
file that contains 20 samples of customer data to be used in the Compulsory Tasks.
• ftec-cs-groupi-test.csv – the testing dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv
file that contains 30 samples of customer data to be used in the Compulsory Tasks.
• ftec-cs-full-train.csv – the training dataset that contains 16657 samples of customer data to be
• ftec-cs-full-test.csv – the training dataset that contains 34870 samples of customer data to be used
Lastly, the Jupyter notebook ftec2101 project 2022.ipynb provided detailed descriptions and helper codes
to guide you through the tasks required by this project. Please pay attention to the comments provided inside
the code cells of the Jupyter notebook as well.
The compulsory tasks of this project is divided into two parts. You are required to
(A) answer questions related to the optimization theory and modeling related to binary classification; and
(B) implement the modeled optimization problems on computer and solve them with the GiveMeSomeCredits
dataset.
FTEC2101/ESTR2520 Project 3
Theory and Modeling Denote the training dataset withm samples as {x(i), y(i)}mi=1, where x(i) ∈ Rd
is the ith feature vector and y(i) ∈ {±1} is the associated label. Let λ > 0 be a fixed scalar. The following
optimization problem designs what is known as the hard-margin classifier [Sa, 2022]:
min
w∈Rd,b∈R
1
2
w>w
s.t. y(i)
(
(x(i))>w + b
) ≥ 1, i = 1, . . . ,m. (1.2)
It can be easily shown that (1.2) is a convex optimization problem.
(a) Explain how solving (1.2) finds a classifier (w?, b?) that can correctly distinguish the m training
samples into the +1 or −1 labels. (Hint: focus on the constraints in (1.2) and compare it to (1.1)
as well as the figure that explains it.)
(b) Derive the KKT condition for (1.2). Using the KKT condition, show that the optimal w? is a linear
combination of x(i) for the samples i such that 1 = y(i)((x(i))>w? + b?).
(c) Give an example of training dataset with d = 2 where (1.2) is infeasible. You may describe such
dataset by drawing it on a 2-D plane.
Due to the issue of infeasibility as investigated in part (c) of Task 1, the hard-margin formulation (1.2) is
seldom used in practice. Instead, we consider a soft-margin formulation, which introduces a non-negative
‘auxiliary decision variable’3 ξ(i) to each of the constraint, yielding:
y(i)
(
(x(i))>w + b
) ≥ 1− ξi, ξi ≥ 0, i = 1, . . . ,m. (1.3)
(
(x(i))>w + b
) ≥ 1 for all training samples, i.e., requiring all the training samples are
correctly classified, the above constraint with the auxiliary variable ξi allows for erroneous classification.
The auxiliary variable ξi is interpreted as the amount of error for the ith sample. Notice that ξi > 0 if and
only if the ith training sample is mis-classified for the sample, i.e., y(i)
(
(x(i))>w + b
)
1. Ideally, we should
find (w, b) such that the number of samples i with ξi > 0 is minimized.
Below, you will formulate a mixed integer program (in fact, a mixed-integer non-linear program, MI-NLP)
optimization problem for the soft constraint optimization.
Task 2: Mixed Integer Formulation (5%)
Answer the following question: Based on the soft-margin constraint (1.3) [as well as the max-margin
problem (1.2)], formulate an optimization problem with the following specification:
• The objective is to minimize the total number of auxiliary decision variables ξi with ξi > 0
(i.e., the total number of errors).
• The amount of error is bounded such that
0 ≤ ξi ≤M, i = 1, ...,m,
3Note that it bears some resemblance to the artificial variable technique we learnt in the Big-M simplex method.
FTEC2101/ESTR2520 Project 4
where M is a known/given constant.
• The soft-margin constraints (1.3) are satisfied.
• The directional parameters w ∈ Rd satisfies the following shaping constraint:
w>Σw + c>w ≤ 1,
where Σ ∈ Rd×d is a given symmetric, positive definite matrix, and c ∈ Rd is a given vector.
The MIP is difficult to solve for large-scale problems with m 1. As a remedy, we shall approximate the
objective function in Task 2 using a linear function in {ξi}mi=1. Consider the following task:
(a) Based on the soft-margin constraint (1.3) [and the max-margin problem (1.2)], formulate an opti-
mization problem with the following specifications (notice that unlike in Task 2, we do not impose
the constraint ξi ≤M):
• The objective is to minimize the weighted sum of constraint violation over all training sample:
m∑
i=1
`iξ
(i)
where `i > 0, i = 1, ...,m is a set of given weights.
• The soft-margin constraints (1.3) are satisfied.
• The directional parameters w ∈ Rd satisfies the following shaping constraint:
w>Σw + c>w ≤ 1,
where Σ ∈ Rd×d is a given symmetric, positive definite matrix and c ∈ Rd is a given vector.
(b) Show that the problem in part (a) can be reformulated as an SOCP.
Computational We next focus on solving the optimization problems formulated in Task 2 & 3 on
computers. We structure the tasks into 3 stages: data analysis, optimization, interpretation of results. The
Jupyter notebook provided descriptions and helper codes to guide you through most of the following tasks.
Please pay attention to the comments in the Jupyter notebook as well.
Data Analysis. In the compulsory tasks, we will focus on a small dataset that only consist of m = 20 samples
(stored in ftec-cs-groupi-train.csv). The training dataset contains the information of
”RevolvingUtilizationOfUnsecuredLines”, ”age”, ”DebtRatio”, ”MonthlyIncome”,
”NumberOfOpenCreditLinesAndLoans”, ”NumberRealEstateLoansOrLines”, ”NumberOfDependents”
for m = 20 customers. Notice that the dataset have been preprocessed such that the above features are
normalized with typical values ranging between 0 and 1 (there are some exceptions though).
FTEC2101/ESTR2520 Project 5
Moreover, the training dataset contains the information of whether the customer has a history of “Default”
or not. We treat the latter information as the label yi to train our classifier to decide if the customer is worth
granting the grant or not.
We shall first get ourselves familiar with the dataset. In particular, we shall visualize the dataset to observe
the behavior pattern of a customer with or without a ‘Default’ history.
(a) Inspect the dataset by making a 2-D scatter plots of the 20 samples over the features
‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’.
Mark the data points of ‘Default’ (resp. ‘Non-default’) customers in red (resp. blue). Comment on
the pattern observed for ‘Default’ customers.
(b) Try 2-3 more combinations of pairs of features and make comments on the observations.
Remark: The program template has provided the relevant helper codes for this task, but you may have
to ‘tweak’ the template to examine other pairs of features in part (b).
For part (a), you may observe a similar output to the following:
Classifier Optimization. In the following tasks, you will implement classifier designs based on MI-NLP and
SOCP from Task 2 & 3.
As a first step, we shall design the classifier based on only two features in the data, namely,
‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’. (1.4)
In other words, we consider the classifier design problems [cf. Task 2 & 3] with d = 2 first.
(a) Based on the selected features in (1.4), implement and solve the MI-NLP from Task 2 with the
FTEC2101/ESTR2520 Project 6
following parameters:
M = 100, Σ = 1200
(
1 0
0 1
)
, c =
(
0
0
)
, `i = 1, i = 1, . . . ,m.
You may use the solver Juniper (with Ipopt) in JuMP for tackling the MI-NLP.
(b) Based on the selected features in (1.4), implement and solve the SOCP from Task 3 with the following
parameters:
Σ = 1200
(
1 0
0 1
)
, c =
(
0
0
)
, `i = 1, i = 1, . . . ,m.
You may use the solver ECOS in JuMP for tackling the SOCP.
(c) Make a scatter plot of the training dataset for the features in (1.4) as in Task 4-(a). Then, overlay
the linear classifiers found in part (a) and part (b) on top of this scatter plot. Comment on whether
the classifiers found are reasonable.
Notice that it may take a long time to solve the MI-NLP in Task 5-(a) since an MI-NLP problem is quite
challenging to solve in general. The scatter plot in Task 5-(c) may look like
which indicate the classifiers found by the MIP and SOCP formulations.
Similar to the Computing Lab 2, the performance of a classifier can be evaluated by the error rate when
applied on a certain set of data. For instance, recalling that {x(i), y(i)}mi=1 is the training dataset, the training
error rate for a given classifier (w, b) can be estimated as
Training Error Rate =
1
m
m∑
i=1
1(y(i) 6= yˆ(i)) where yˆ(i) =
{
+1, if w>x(i) + b ≥ 0,
−1, if w>x(i) + b < 0,
i.e., it is the number of errors made by the classifier divided by the number of samples. Notice that 0 ≤
Training Error Rate ≤ 1.
As our aim is to design a classifier that makes prediction on whether a future customer (i.e., not found in the
training dataset) will ‘default’ on his/her loans, it is necessary to evaluate the error rate on a testing dataset
that is unseen by the program. Denote the testing dataset with mtest samples as {x(i)test, y(i)test}mtesti=1 , the testing
FTEC2101/ESTR2520 Project 7
error rate for a classifier (w, b) can be estimated as
Testing Error Rate =
1
mtest
mtest∑
i=1
1(y
(i)
test 6= yˆ(i)) where yˆ(i) =
{
+1, if w>x(i)test + b ≥ 0,
−1, if w>x(i)test + b < 0.
To this end, we consider the following task:
Notice that for our project, the testing dataset is stored in ftec-cs-groupi-test.csv.
(a) Write a function to evaluate the testing/training error rate as defined in the above.
(b) Evaluate and compare the error rate performances for the two classifiers you have found in Task 5.
Remark : Try to make the function for evaluating error in (a) general such that it takes dataset of any
size and features of any dimension. You will have to reuse the same function in Task 7 & 8.
Intuitively, the performance of a classifier can be improve when more features are included. As the last task
in the compulsory section, we shall use all the d = 7 features of the GiveMeSomeCredit dataset to design our
classifier:
(a) Repeat Task 5-(a), 5-(b), but design the classifiers based on all the d = 7 features with the following
parameters:
M = 100, Σ = 1200I7, c = 0, `i = 1, i = 1, . . . ,m.
where I7 is the 7× 7 identity matrix, and 0 is a 7× 1 all zero vector. By inspecting the optimized
classifier from the SOCP, describe which features will be important for deciding if a customer is
likely to default. Comment on whether this observation is reasonable or not.
(b) Evaluate and compare the error rate performances for the two classifiers you have found. Is the
training error rate better than the classifier based on only d = 2 features in Task 5? What about
the testing error rate? Comment on your observations.
(c) Change parameters such as adjusting the shaping constraint, e.g., change Σ to 110I7 and observe its
effects on the error rate, etc. Comment on your observations.
Unlike the compulsory tasks, the goal of this competitive task is to implement your own (approximate) solver
to the binary classifier problem, without relying on JuMP and its optimizers such as ECOS, Juniper, etc.. To
motivate, we observe that while optimization packages such as JuMP are convenient to use, they are often
limited by scalability to large-scale problems when the number of training samples m 1 and/or the feature
is high dimensional d 1. The task would require considerably more advanced coding skills.
Our objectives are to find the linear classifier with the best training/testing error and the lowest computation
cost using a custom-made iterative algorithm such as projected gradient descent. With the shaping constraint
defined as:
Σ = I7, c = 0.
FTEC2101/ESTR2520 Project 8
Under the above specialization, the SOCP problem in Task 3-(a) will be equivalent to:
min
w∈Rd,b∈R
f(w, b) :=
1
m
m∑
i=1
`i max
{
0, 1− y(i)((x(i))>w + b)}
s.t. w>w ≤ 1,
(1.5)
Note that the above problem admits a constraint set for (w, b) ∈ Rd+1 as
X = {w ∈ Rd, b ∈ R : w>w ≤ 1}. (1.6)
An obstacle in applying gradient-based algorithms on (1.5) lies on the non-differentiability of f(w; b). Taking
this viewpoint, we shall consider some differentiable function f̂(w, b) that approximates f(w, b) (to be specified
later) and solve the constrained optimization:
min
w∈Rd,b∈R
f̂(w, b) s.t. (w, b) ∈ X. (1.7)
It should be easy to apply iterative algorithms such as projected gradient descent (PGD), conditional gradient,
etc. (see Appendix A), to tackle the above problem.
Among other options, a potential candidate for approximating f(w; b) is to consider the logistic loss4 as we
have done in Lecture 16 / Computing Lab 2:
f̂(w, b) =
1
m
m∑
i=1
`i log
(
1 + exp
(
−y(i)((x(i))>w + b)
))
. (1.8)
Minimizing the above function leads to a solution (w, b) such that y(i)((x(i))>w+b) is large for all i = 1, ...,m,
which makes a desired feature for a good classifier.
Task 8: Customized Solver for Classifier Optimization (30%)
Using the dataset with the training data from m = 16657 samples in ftec-cs-full-train.csv. Imple-
ment an iterative algorithm to tackle (1.7). Notice that due to (1.6) the optimized classifier must satisfy
w>w ≤ 1. Moreover, you are required to initialize your algorithm by w0 = 0, b0 = 0.
Suggestion: As the first attempt, you may consider the projected gradient descent (PGD) method using
a constant step size with f̂(w, b) selected as the logistic function (1.8). You may need to tune the weights
{`i}mi=1 to obtain a better result while paying attention to the ‘size’ (i.e., ‖x(i)‖) for each of the feature
vector. Other candidates are also suggested in Appendix A.
Assessment You will receive a maximum of 10% for correctly implementing at least one numerical algo-
rithm (e.g., projected gradient), together with
1. plotting the trajectory of the algorithm is show that the objective value in (1.7) to be decreasing to a
certain value asymptotically and providing comments on the algorithm(s) implemented,
2. providing derivations and justifications on why the implemented algorithm is used.
Besides achieving a low training/testing error rates, a good binary classifier should also be efficiently com-
putable. As we are dealing with a relatively large dataset, the complexity of your algorithm will be measured
by the number of accesses to the training dataset5. Every time when you evaluate the value of the
4Notice that the logistic objective function can be interpreted alternatively as a formulation for the classifier design task with
the maximum-likelihood (ML) principle from statistics. This is beyond the scope of this project specification.
5This is sometimes known as the sample complexity.
FTEC2101/ESTR2520 Project 9
objective function f̂(w, b), or the gradient of f̂(w, b), will incur one access to the training dataset.
For instance when you are applying a projected GD method with constant step size for 1000 iterations. As
the algorithm evaluates the gradient of f̂(w, b) only once per iteration, the total number of accesses to the
training dataset will be equal to 1000.
Secondly, another desired property is that the binary classifier should not be biased towards a particular
feature. In particular, here we wish to design a classifier that does not decide on the risk of default for
customer entirely on the feature ‘RevolvingUtilizationOfUnsecuredLines’. Suppose this feature is denoted via
the first coordinate of xt, i.e., [xt]1, in your program, we can measure the biasedness towards it by:
% weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’ =
|w1|∑7
i=1 |wi|
Finally, the remaining 20% of your marks in this task will be calculated according to the following formula:
Score = 5%× exp
(
10 ·max{0.25,Lowest Training Error Rate} − 10 ·max{0.25,Your Training Error Rate}
)
+ 5%× exp
(
10 ·max{0.3,Lowest Testing Error Rate} − 10 ·max{0.3,Your Testing Rate}
)
+ 5%× max{0.25,Lowest % weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’}
max{0.25,Your % weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’}
+ 5%× max{75,Lowest total number of accesses to training dataset}
max{75,Your total number of accesses to training dataset} . (1.9)
The lowest errors (resp. lowest no. of accesses to training dataset, etc.) are the highest one (resp. lowest one)
among the class of FTEC21016. Some tips for improving the performance of your design can be found in
Appendix A.
If you have tried more than one algorithm and/or more than one type of approximation, you have to select
only one set of classifier parameters (w, b) for consideration of the competition via (1.9). Please indicate
which solution is selected in your report and include that in the submission of your program files. That said,
you are encouraged to try more of these different variants and include them in the project report. Moreover,
observe the following rules:
• The algorithms you designed are not allowed to directly optimize on the testing set data. In other
words, your iterative algorithm should not rely on any data in ftec-cs-full-test.csv as you are not
supposed to see the ‘future customer’ data while training a classifier for loan applicants. Your score in
(1.9) will be set to zero if we detect such ‘cheating’ behavior. However, you can evaluate the test error
performance of your solution as many time as you like before you found the best setting.
• Your selected algorithm for the competition must be deterministic. In other words, you may not use
stochastic algorithm such as stochastic gradient descent (SGD) for the competition. That being said,
you are still encouraged to try such algorithms as an additional task which may be counted towards the
‘innovation’ section.
If you have questions about the rules, please do not hesitate to consult the instructor at htwai@cuhk.edu.hk
or the TA or raise your questions on Piazza.
6The scores for ESTR2520 students will be calculated by taking the best error performance across both ESTR2520 and
FTEC2101 students.
FTEC2101/ESTR2520 Project 10
1.3 Report (20%)
You are required to compile a project report with answers to the questions posed in Task 1 to Task 8. You
may structure the report according to the order of the tasks, for instance:
1. Background and Introduction — In this section, you can briefly introduce the problem, e.g., explain-
ing the goal of classifier design, discussing the role of optimization methods in tackling the problem.
2. Model and Theory — In this section, you can discuss how the classifier design problem is modeled as
optimization problems. More specifically,
– You shall begin by discussing the hard-margin formulation (1.2) and then answer Task 1.
– Next, you can describe the optimization models with soft-margin and then answer Tasks 2 & 3.
3. Experiments — In this section, you describe the experiments conducted to test your formulation, i.e.,
properties regarding the dataset, e.g., the size of the dataset, the range of the values for the different
features.
– Then, you can describe the experiments for each of the 2 formulations by answering Tasks 5 & 6.
4. Competitive Task — In this section, you describe the custom solver you built to solve the large-scale
classifier design problem, i.e.,
– You shall first describe the approximation technique as laid out in the discussion of (1.5), (1.8).
– Then, you shall describe the iterative algorithm you have derived in Task 8.
– Apply the iterative algorithm on the complete training dataset and show the objective value vs. iter-
ation number. Discuss whether the algorithm converges and report on the performance of the designed
classifier.
5. Conclusions — In this section, you shall summarize the findings in the project, and discuss various
aspects that can be improved with the formulation, etc..
paper and scan it to your Word/PDF report as a figure. For Task 4 to 8, please include all the plots and
dataset of the selected solution, % weight on ‘RevolvingUtilizationOfUnsecuredLines’ (we will also run your
code to verify the values reported).
The program code has to be submitted separately. However, you are welcomed to use excerpts from the
program codes in the report if you find it helpful for explaining your solution concepts.
Lastly, you are welcomed to use online resources when preparing the project. However, you must give proper
references for sources that are not your original creation.
Assessment Here is a breakdown of the assessment metric for the report writing component.
• (10%) Report Writing : A project report shall be readable to a person with knowledge in optimization
(e.g., your classmates in FTEC2101/ESTR2520). Make sure that your report is written with clarity, and
more importantly, using your own language!
FTEC2101/ESTR2520 Project 11
• (10%) Innovation: You can get innovation marks if you include extra experiments, presentations,
etc.. that are relevant to the project (with sufficient explanations); see Appendix A for some recom-
mendations.
1.4 Submission
This is an individual project. While discussions regarding how to solve the problems is encouraged, students
should answer the problems on their own (just like your HWs). The deadline of submission is May 13 (Friday), 2022,
23:59 (HKT). Please submit with the following content to Blackboard:
• Your Project Report in PDF format.
• Your Program Codes [either in Jupyter notebook (.ipynb), or Julia code (.jl)].
In addition, the project report shall be submitted to VeriGuide for plagiarism check.
8. Of course, you are more than welcomed to propose and explore new ideas (but still, make sure that they
are mathematically feasible)!
1. The design of the weights {`i}mi=1 maybe crucial to the performance of your classifier. For instance, we
observe that the gradient of f̂(w, b) can be written as
∇f̂(w, b) = 1
m
m∑
i=1
`i · ∇ log
(
1 + e−y
(i)((x(i))>w+b)
)
You may observe that the latter term ∇ log(1 + e−y(i)((x(i))>w+b)) is dependent on x(i). Especially, try to
imagine if ‖x(i)‖ is extremely large7? To this regard, a possibly remedy is to appropriately design `i to
‘downplay’ the effect of those samples with large ‖x(i)‖ on the classifier design.
(Mind that such a pre-processing step to design `i will cost you one access to the training dataset).
2. As an alternative approach, we recall from Task 2 that the optimization’s goal is to minimize the
number of training samples that violate the constraints in (1.3). We may replace the logistic loss function
by the sigmoid loss function, given by:
f̂(w, b) =
1
m
m∑
i=1
`i
1
1 + ey
(i)((x(i))>w+b)
which puts less ‘penalty’ on solutions with y(i)((x(i))>w + b) 1 than the logistics loss. Note that the
above function is not convex in (w, b), yet the projected gradient descent method continues to work, e.g., as
shown in [Beck, 2017].
7The sample with extremely large ‖x(i)‖ is sometimes called an outlier.
FTEC2101/ESTR2520 Project 12
3. To reduce the weight of a particular feature in the classifier, you may adopt a penalization approach
(similar to the Big-M method in principle). For example, if you want to suppress the value of w1, you may
consider the following penalized objective function:
f˜(w, b) = f̂(w, b) + α |w1|2
for some α > 0 as an adjustable penalty parameter.
• Algorithm Aspect – For Task 8, with the constraint structure given, the recommended algorithm is projected
gradient descent (PGD) method, which are described as follows. For solving a general optimization:
min
w∈Rd,b∈R
f̂(w, b) s.t. (w, b) ∈ X. (1.10)
With a slight abuse of notation, we denote x ≡ (w, b) and the PGD method can be described as
Input: x(0) ∈ X, constant step size γ > 0, max. it-
eration number Kmax.
For k = 0, ...,Kmax
x(k+1) = ProjX
{
x(k) − γ∇f̂(x(k))}
End For
PGD Method
For any (w, b) ∈ Rd × R, we have
(wP , bP ) = ProjX(w, b)
with
wP =
1
max{1, ‖w‖} w, bP = b.
Besides, you are more than welcomed to explore the use of other iterative algorithms, e.g., conditional
gradient, back tracking line search, etc., for solving the optimization at hand.
• Presentation Aspect – Here are some suggestions for discussions which you may include to enrich the content
1. To understand and compare between the hinge loss, the logistics loss, and the sigmoid loss, it may be
helpful to plot the following 1-D functions:
Hinge(x) = max{0, 1− x}, Logistics(x) = log(1 + e−x), Sigmoid(x) = 1
1 + ex
.
and discuss on how they put penalty on situation when x is large.
2. The performance of binary classification can be measured more precisely by considering the precision rate
and recall rate. For a given labeled dataset (e.g., the testing dataset), they are defined respectively as:
Precision =
No. of samples with y
(i)
test = yˆ
(i) = 1
No. of samples with yˆ(i) = 1
, Recall =
No. of samples with y
(i)
test = yˆ
(i) = 1
No. of samples with y
(i)
test = 1
Note that y
(i)
test = yˆ
(i) = 1 means that the ith sample is a customer who defaulted and the classifier correctly
identifies him/her as a customer who will default.
Lastly, tips for implementing the MI-NLP, SOCP, etc.. in the compulsory tasks have been included with the
the template program. Be reminded that it is not necessary to follow all the tips therein.
References
A. Beck. First-order methods in optimization. SIAM, 2017.
C. D. Sa. Support Vector Machine, Cornell University CS4780, 2022. URL https://www.cs.cornell.edu/
courses/cs4780/2022sp/notes/LectureNotes13.html.