xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

julia代写-FTEC2101/ESTR2520

时间：2022-05-06

FTEC2101/ESTR2520 Optimization Methods Spring 2022

Project Specification – Binary Classification

Last Updated: April 25, 2022, Deadline: May 13, 2022, 23:59 (HKT)

Thus far we have learnt a number of optimization methods, ranging from the simplex method for linear

programming, modeling techniques for integer programming, gradient/Newton methods for unconstrained

optimization, KKT conditions, SOCP formulation, etc.. Besides theories, please remember that optimization

methods are practical tools for solving real world problems. The aim of this project is to practice our skill on

the latter aspect. We will make use of the Julia1 environment.

The project’s theme is about practical solutions for binary classification. It can be considered as an extension to

what we have experimented in Computing Lab 2. We will explore additional aspects about binary classification

and more importantly, implement and compare these ideas on a real-world dataset.

Background. Let us recall that the binary classification task aims at learning a linear classifier that distin-

guishes a feature vector as positive (+1) or negative (−1). Let d ∈ N be the dimension of the feature vector,

the (linear) classifier is described by the tuple (w, b) where w ∈ Rd is the direction parameters and b ∈ R is

the bias parameter2, both specifying the linear model. Given (w, b), a feature vector x ∈ Rd is classified into

a label y ∈ {±1} such that

y =

{

+1, if w>x+ b = w1x1 + · · ·+ wdxd + b ≥ 0,

−1, if w>x+ b = w1x1 + · · ·+ wdxd + b < 0.

(1.1)

As an illustration, the following figure shows a case with d = 2 such that the classifier is described by

w = (w1, w2) and the scalar b:

I

"

\ ✗ ✗ ✗

ii.

+

+

✗

✗

1-

✗

\ ✗

y l

l

✗ X '

e Wixitwzlz -112--0

y ✗ ,

i

,

2×1

According to the rule in (1.1), all the feature vectors that lie above the dashed line shall be classified as +1

(blue points); those that lie below the dashed line shall be classified as −1 (red points).

1As an alternative, you are welcomed to use Python with optimization modeling packages supporting SOCP and MI-NLP such

as cvxpy. However, you have to notify the instructor about the choice and the plan on how you wish to accomplish the project

in the latter case on or before May 1, 2022, i.e., two weeks before the deadline. In the project, you are not allowed to use any

pre-built solver for binary classification such as MLJ.jl, ScikitLearn.jl, flux.jl scikit-learn as your final submission (though

you are encouraged to try out these packages as extra work to support your solution).

2Note that this differs slightly from Lecture 16 as we include the bias parameter in the classifier design.

1

FTEC2101/ESTR2520 Project 2

Dataset & Folder Setting We have m ∈ N samples of training data described by {x(i), y(i)}mi=1, where

x(i) ∈ Rd is the ith feature vector and y(i) ∈ {±1} is the associated label. We will use a curated version

of the GiveMeSomeCredits dataset taken from https://www.kaggle.com/c/GiveMeSomeCredit. It includes

customer data such as their debt ratios, credit utilization, age, etc., and the history of defaulting. With these

information, our goal is to learn a classifier that predicts if a customer has any risk of defaulting before the

bank approves loan(s) for him/her.

To prepare for the project, please retrieve the Jupyter notebook template ftec2101-project-template-2022.ipynb

and the zip archive ftec project files.zip from the Blackboard. Place the *.ipynb files in a working direc-

tory, extract the *.zip file and move the *.csv files inside the same working directory. Your working directory

should have the following content:

Notice that:

• ftec-cs-groupi-train.csv – the training dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv

file that contains 20 samples of customer data to be used in the Compulsory Tasks.

• ftec-cs-groupi-test.csv – the testing dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv

file that contains 30 samples of customer data to be used in the Compulsory Tasks.

• ftec-cs-full-train.csv – the training dataset that contains 16657 samples of customer data to be

used in the Competitive Tasks.

• ftec-cs-full-test.csv – the training dataset that contains 34870 samples of customer data to be used

in the Competitive Tasks.

Lastly, the Jupyter notebook ftec2101 project 2022.ipynb provided detailed descriptions and helper codes

to guide you through the tasks required by this project. Please pay attention to the comments provided inside

the code cells of the Jupyter notebook as well.

1.1 Compulsory Tasks (50%)

The compulsory tasks of this project is divided into two parts. You are required to

(A) answer questions related to the optimization theory and modeling related to binary classification; and

(B) implement the modeled optimization problems on computer and solve them with the GiveMeSomeCredits

dataset.

FTEC2101/ESTR2520 Project 3

Theory and Modeling Denote the training dataset withm samples as {x(i), y(i)}mi=1, where x(i) ∈ Rd

is the ith feature vector and y(i) ∈ {±1} is the associated label. Let λ > 0 be a fixed scalar. The following

optimization problem designs what is known as the hard-margin classifier [Sa, 2022]:

min

w∈Rd,b∈R

1

2

w>w

s.t. y(i)

(

(x(i))>w + b

) ≥ 1, i = 1, . . . ,m. (1.2)

It can be easily shown that (1.2) is a convex optimization problem.

Task 1: Hard-margin Formulation (7.5%)

Answer the following question:

(a) Explain how solving (1.2) finds a classifier (w?, b?) that can correctly distinguish the m training

samples into the +1 or −1 labels. (Hint: focus on the constraints in (1.2) and compare it to (1.1)

as well as the figure that explains it.)

(b) Derive the KKT condition for (1.2). Using the KKT condition, show that the optimal w? is a linear

combination of x(i) for the samples i such that 1 = y(i)((x(i))>w? + b?).

(c) Give an example of training dataset with d = 2 where (1.2) is infeasible. You may describe such

dataset by drawing it on a 2-D plane.

Due to the issue of infeasibility as investigated in part (c) of Task 1, the hard-margin formulation (1.2) is

seldom used in practice. Instead, we consider a soft-margin formulation, which introduces a non-negative

‘auxiliary decision variable’3 ξ(i) to each of the constraint, yielding:

y(i)

(

(x(i))>w + b

) ≥ 1− ξi, ξi ≥ 0, i = 1, . . . ,m. (1.3)

Instead of forcing y(i)

(

(x(i))>w + b

) ≥ 1 for all training samples, i.e., requiring all the training samples are

correctly classified, the above constraint with the auxiliary variable ξi allows for erroneous classification.

The auxiliary variable ξi is interpreted as the amount of error for the ith sample. Notice that ξi > 0 if and

only if the ith training sample is mis-classified for the sample, i.e., y(i)

(

(x(i))>w + b

)

1. Ideally, we should

find (w, b) such that the number of samples i with ξi > 0 is minimized.

Below, you will formulate a mixed integer program (in fact, a mixed-integer non-linear program, MI-NLP)

optimization problem for the soft constraint optimization.

Task 2: Mixed Integer Formulation (5%)

Answer the following question: Based on the soft-margin constraint (1.3) [as well as the max-margin

problem (1.2)], formulate an optimization problem with the following specification:

• The objective is to minimize the total number of auxiliary decision variables ξi with ξi > 0

(i.e., the total number of errors).

• The amount of error is bounded such that

0 ≤ ξi ≤M, i = 1, ...,m,

3Note that it bears some resemblance to the artificial variable technique we learnt in the Big-M simplex method.

FTEC2101/ESTR2520 Project 4

where M is a known/given constant.

• The soft-margin constraints (1.3) are satisfied.

• The directional parameters w ∈ Rd satisfies the following shaping constraint:

w>Σw + c>w ≤ 1,

where Σ ∈ Rd×d is a given symmetric, positive definite matrix, and c ∈ Rd is a given vector.

The MIP is difficult to solve for large-scale problems with m 1. As a remedy, we shall approximate the

objective function in Task 2 using a linear function in {ξi}mi=1. Consider the following task:

Task 3: SOCP Formulation (5%)

Answer the following question:

(a) Based on the soft-margin constraint (1.3) [and the max-margin problem (1.2)], formulate an opti-

mization problem with the following specifications (notice that unlike in Task 2, we do not impose

the constraint ξi ≤M):

• The objective is to minimize the weighted sum of constraint violation over all training sample:

m∑

i=1

`iξ

(i)

where `i > 0, i = 1, ...,m is a set of given weights.

• The soft-margin constraints (1.3) are satisfied.

• The directional parameters w ∈ Rd satisfies the following shaping constraint:

w>Σw + c>w ≤ 1,

where Σ ∈ Rd×d is a given symmetric, positive definite matrix and c ∈ Rd is a given vector.

(b) Show that the problem in part (a) can be reformulated as an SOCP.

Computational We next focus on solving the optimization problems formulated in Task 2 & 3 on

computers. We structure the tasks into 3 stages: data analysis, optimization, interpretation of results. The

Jupyter notebook provided descriptions and helper codes to guide you through most of the following tasks.

Please pay attention to the comments in the Jupyter notebook as well.

Data Analysis. In the compulsory tasks, we will focus on a small dataset that only consist of m = 20 samples

(stored in ftec-cs-groupi-train.csv). The training dataset contains the information of

”RevolvingUtilizationOfUnsecuredLines”, ”age”, ”DebtRatio”, ”MonthlyIncome”,

”NumberOfOpenCreditLinesAndLoans”, ”NumberRealEstateLoansOrLines”, ”NumberOfDependents”

for m = 20 customers. Notice that the dataset have been preprocessed such that the above features are

normalized with typical values ranging between 0 and 1 (there are some exceptions though).

FTEC2101/ESTR2520 Project 5

Moreover, the training dataset contains the information of whether the customer has a history of “Default”

or not. We treat the latter information as the label yi to train our classifier to decide if the customer is worth

granting the grant or not.

We shall first get ourselves familiar with the dataset. In particular, we shall visualize the dataset to observe

the behavior pattern of a customer with or without a ‘Default’ history.

Task 4: Warmup Exercise (5%)

(a) Inspect the dataset by making a 2-D scatter plots of the 20 samples over the features

‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’.

Mark the data points of ‘Default’ (resp. ‘Non-default’) customers in red (resp. blue). Comment on

the pattern observed for ‘Default’ customers.

(b) Try 2-3 more combinations of pairs of features and make comments on the observations.

Remark: The program template has provided the relevant helper codes for this task, but you may have

to ‘tweak’ the template to examine other pairs of features in part (b).

For part (a), you may observe a similar output to the following:

Classifier Optimization. In the following tasks, you will implement classifier designs based on MI-NLP and

SOCP from Task 2 & 3.

As a first step, we shall design the classifier based on only two features in the data, namely,

‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’. (1.4)

In other words, we consider the classifier design problems [cf. Task 2 & 3] with d = 2 first.

Task 5: Optimization-based Formulation (17.5%)

(a) Based on the selected features in (1.4), implement and solve the MI-NLP from Task 2 with the

FTEC2101/ESTR2520 Project 6

following parameters:

M = 100, Σ = 1200

(

1 0

0 1

)

, c =

(

0

0

)

, `i = 1, i = 1, . . . ,m.

You may use the solver Juniper (with Ipopt) in JuMP for tackling the MI-NLP.

(b) Based on the selected features in (1.4), implement and solve the SOCP from Task 3 with the following

parameters:

Σ = 1200

(

1 0

0 1

)

, c =

(

0

0

)

, `i = 1, i = 1, . . . ,m.

You may use the solver ECOS in JuMP for tackling the SOCP.

(c) Make a scatter plot of the training dataset for the features in (1.4) as in Task 4-(a). Then, overlay

the linear classifiers found in part (a) and part (b) on top of this scatter plot. Comment on whether

the classifiers found are reasonable.

Notice that it may take a long time to solve the MI-NLP in Task 5-(a) since an MI-NLP problem is quite

challenging to solve in general. The scatter plot in Task 5-(c) may look like

which indicate the classifiers found by the MIP and SOCP formulations.

Similar to the Computing Lab 2, the performance of a classifier can be evaluated by the error rate when

applied on a certain set of data. For instance, recalling that {x(i), y(i)}mi=1 is the training dataset, the training

error rate for a given classifier (w, b) can be estimated as

Training Error Rate =

1

m

m∑

i=1

1(y(i) 6= yˆ(i)) where yˆ(i) =

{

+1, if w>x(i) + b ≥ 0,

−1, if w>x(i) + b < 0,

i.e., it is the number of errors made by the classifier divided by the number of samples. Notice that 0 ≤

Training Error Rate ≤ 1.

As our aim is to design a classifier that makes prediction on whether a future customer (i.e., not found in the

training dataset) will ‘default’ on his/her loans, it is necessary to evaluate the error rate on a testing dataset

that is unseen by the program. Denote the testing dataset with mtest samples as {x(i)test, y(i)test}mtesti=1 , the testing

FTEC2101/ESTR2520 Project 7

error rate for a classifier (w, b) can be estimated as

Testing Error Rate =

1

mtest

mtest∑

i=1

1(y

(i)

test 6= yˆ(i)) where yˆ(i) =

{

+1, if w>x(i)test + b ≥ 0,

−1, if w>x(i)test + b < 0.

To this end, we consider the following task:

Task 6: Error Performance (5%)

Notice that for our project, the testing dataset is stored in ftec-cs-groupi-test.csv.

(a) Write a function to evaluate the testing/training error rate as defined in the above.

(b) Evaluate and compare the error rate performances for the two classifiers you have found in Task 5.

Remark : Try to make the function for evaluating error in (a) general such that it takes dataset of any

size and features of any dimension. You will have to reuse the same function in Task 7 & 8.

Intuitively, the performance of a classifier can be improve when more features are included. As the last task

in the compulsory section, we shall use all the d = 7 features of the GiveMeSomeCredit dataset to design our

classifier:

Task 7: Full Classifier (5%)

(a) Repeat Task 5-(a), 5-(b), but design the classifiers based on all the d = 7 features with the following

parameters:

M = 100, Σ = 1200I7, c = 0, `i = 1, i = 1, . . . ,m.

where I7 is the 7× 7 identity matrix, and 0 is a 7× 1 all zero vector. By inspecting the optimized

classifier from the SOCP, describe which features will be important for deciding if a customer is

likely to default. Comment on whether this observation is reasonable or not.

(b) Evaluate and compare the error rate performances for the two classifiers you have found. Is the

training error rate better than the classifier based on only d = 2 features in Task 5? What about

the testing error rate? Comment on your observations.

(c) Change parameters such as adjusting the shaping constraint, e.g., change Σ to 110I7 and observe its

effects on the error rate, etc. Comment on your observations.

1.2 Competitive Tasks (30%)

Unlike the compulsory tasks, the goal of this competitive task is to implement your own (approximate) solver

to the binary classifier problem, without relying on JuMP and its optimizers such as ECOS, Juniper, etc.. To

motivate, we observe that while optimization packages such as JuMP are convenient to use, they are often

limited by scalability to large-scale problems when the number of training samples m 1 and/or the feature

is high dimensional d 1. The task would require considerably more advanced coding skills.

Our objectives are to find the linear classifier with the best training/testing error and the lowest computation

cost using a custom-made iterative algorithm such as projected gradient descent. With the shaping constraint

defined as:

Σ = I7, c = 0.

FTEC2101/ESTR2520 Project 8

Under the above specialization, the SOCP problem in Task 3-(a) will be equivalent to:

min

w∈Rd,b∈R

f(w, b) :=

1

m

m∑

i=1

`i max

{

0, 1− y(i)((x(i))>w + b)}

s.t. w>w ≤ 1,

(1.5)

Note that the above problem admits a constraint set for (w, b) ∈ Rd+1 as

X = {w ∈ Rd, b ∈ R : w>w ≤ 1}. (1.6)

An obstacle in applying gradient-based algorithms on (1.5) lies on the non-differentiability of f(w; b). Taking

this viewpoint, we shall consider some differentiable function f̂(w, b) that approximates f(w, b) (to be specified

later) and solve the constrained optimization:

min

w∈Rd,b∈R

f̂(w, b) s.t. (w, b) ∈ X. (1.7)

It should be easy to apply iterative algorithms such as projected gradient descent (PGD), conditional gradient,

etc. (see Appendix A), to tackle the above problem.

Among other options, a potential candidate for approximating f(w; b) is to consider the logistic loss4 as we

have done in Lecture 16 / Computing Lab 2:

f̂(w, b) =

1

m

m∑

i=1

`i log

(

1 + exp

(

−y(i)((x(i))>w + b)

))

. (1.8)

Minimizing the above function leads to a solution (w, b) such that y(i)((x(i))>w+b) is large for all i = 1, ...,m,

which makes a desired feature for a good classifier.

Task 8: Customized Solver for Classifier Optimization (30%)

Using the dataset with the training data from m = 16657 samples in ftec-cs-full-train.csv. Imple-

ment an iterative algorithm to tackle (1.7). Notice that due to (1.6) the optimized classifier must satisfy

w>w ≤ 1. Moreover, you are required to initialize your algorithm by w0 = 0, b0 = 0.

Suggestion: As the first attempt, you may consider the projected gradient descent (PGD) method using

a constant step size with f̂(w, b) selected as the logistic function (1.8). You may need to tune the weights

{`i}mi=1 to obtain a better result while paying attention to the ‘size’ (i.e., ‖x(i)‖) for each of the feature

vector. Other candidates are also suggested in Appendix A.

Assessment You will receive a maximum of 10% for correctly implementing at least one numerical algo-

rithm (e.g., projected gradient), together with

1. plotting the trajectory of the algorithm is show that the objective value in (1.7) to be decreasing to a

certain value asymptotically and providing comments on the algorithm(s) implemented,

2. providing derivations and justifications on why the implemented algorithm is used.

Besides achieving a low training/testing error rates, a good binary classifier should also be efficiently com-

putable. As we are dealing with a relatively large dataset, the complexity of your algorithm will be measured

by the number of accesses to the training dataset5. Every time when you evaluate the value of the

4Notice that the logistic objective function can be interpreted alternatively as a formulation for the classifier design task with

the maximum-likelihood (ML) principle from statistics. This is beyond the scope of this project specification.

5This is sometimes known as the sample complexity.

FTEC2101/ESTR2520 Project 9

objective function f̂(w, b), or the gradient of f̂(w, b), will incur one access to the training dataset.

For instance when you are applying a projected GD method with constant step size for 1000 iterations. As

the algorithm evaluates the gradient of f̂(w, b) only once per iteration, the total number of accesses to the

training dataset will be equal to 1000.

Secondly, another desired property is that the binary classifier should not be biased towards a particular

feature. In particular, here we wish to design a classifier that does not decide on the risk of default for

customer entirely on the feature ‘RevolvingUtilizationOfUnsecuredLines’. Suppose this feature is denoted via

the first coordinate of xt, i.e., [xt]1, in your program, we can measure the biasedness towards it by:

% weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’ =

|w1|∑7

i=1 |wi|

Finally, the remaining 20% of your marks in this task will be calculated according to the following formula:

Score = 5%× exp

(

10 ·max{0.25,Lowest Training Error Rate} − 10 ·max{0.25,Your Training Error Rate}

)

+ 5%× exp

(

10 ·max{0.3,Lowest Testing Error Rate} − 10 ·max{0.3,Your Testing Rate}

)

+ 5%× max{0.25,Lowest % weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’}

max{0.25,Your % weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’}

+ 5%× max{75,Lowest total number of accesses to training dataset}

max{75,Your total number of accesses to training dataset} . (1.9)

The lowest errors (resp. lowest no. of accesses to training dataset, etc.) are the highest one (resp. lowest one)

among the class of FTEC21016. Some tips for improving the performance of your design can be found in

Appendix A.

If you have tried more than one algorithm and/or more than one type of approximation, you have to select

only one set of classifier parameters (w, b) for consideration of the competition via (1.9). Please indicate

which solution is selected in your report and include that in the submission of your program files. That said,

you are encouraged to try more of these different variants and include them in the project report. Moreover,

observe the following rules:

• The algorithms you designed are not allowed to directly optimize on the testing set data. In other

words, your iterative algorithm should not rely on any data in ftec-cs-full-test.csv as you are not

supposed to see the ‘future customer’ data while training a classifier for loan applicants. Your score in

(1.9) will be set to zero if we detect such ‘cheating’ behavior. However, you can evaluate the test error

performance of your solution as many time as you like before you found the best setting.

• Your selected algorithm for the competition must be deterministic. In other words, you may not use

stochastic algorithm such as stochastic gradient descent (SGD) for the competition. That being said,

you are still encouraged to try such algorithms as an additional task which may be counted towards the

‘innovation’ section.

If you have questions about the rules, please do not hesitate to consult the instructor at htwai@cuhk.edu.hk

or the TA or raise your questions on Piazza.

6The scores for ESTR2520 students will be calculated by taking the best error performance across both ESTR2520 and

FTEC2101 students.

FTEC2101/ESTR2520 Project 10

1.3 Report (20%)

You are required to compile a project report with answers to the questions posed in Task 1 to Task 8. You

may structure the report according to the order of the tasks, for instance:

1. Background and Introduction — In this section, you can briefly introduce the problem, e.g., explain-

ing the goal of classifier design, discussing the role of optimization methods in tackling the problem.

2. Model and Theory — In this section, you can discuss how the classifier design problem is modeled as

optimization problems. More specifically,

– You shall begin by discussing the hard-margin formulation (1.2) and then answer Task 1.

– Next, you can describe the optimization models with soft-margin and then answer Tasks 2 & 3.

3. Experiments — In this section, you describe the experiments conducted to test your formulation, i.e.,

– You shall first describe the dataset by answering Task 4. In addition, it is helpful to describe a few

properties regarding the dataset, e.g., the size of the dataset, the range of the values for the different

features.

– Then, you can describe the experiments for each of the 2 formulations by answering Tasks 5 & 6.

– Finally, you can compare the formulations by answering Task 7.

4. Competitive Task — In this section, you describe the custom solver you built to solve the large-scale

classifier design problem, i.e.,

– You shall first describe the approximation technique as laid out in the discussion of (1.5), (1.8).

– Then, you shall describe the iterative algorithm you have derived in Task 8.

– Apply the iterative algorithm on the complete training dataset and show the objective value vs. iter-

ation number. Discuss whether the algorithm converges and report on the performance of the designed

classifier.

5. Conclusions — In this section, you shall summarize the findings in the project, and discuss various

aspects that can be improved with the formulation, etc..

Throughout the report, please feel free to write your answer which involves equations (e.g., Task 1-3) on a

paper and scan it to your Word/PDF report as a figure. For Task 4 to 8, please include all the plots and

comments as requested. For Task 8, please indicate the Training error, Testing error, No. of access to training

dataset of the selected solution, % weight on ‘RevolvingUtilizationOfUnsecuredLines’ (we will also run your

code to verify the values reported).

The program code has to be submitted separately. However, you are welcomed to use excerpts from the

program codes in the report if you find it helpful for explaining your solution concepts.

Lastly, you are welcomed to use online resources when preparing the project. However, you must give proper

references for sources that are not your original creation.

Assessment Here is a breakdown of the assessment metric for the report writing component.

• (10%) Report Writing : A project report shall be readable to a person with knowledge in optimization

(e.g., your classmates in FTEC2101/ESTR2520). Make sure that your report is written with clarity, and

more importantly, using your own language!

FTEC2101/ESTR2520 Project 11

• (10%) Innovation: You can get innovation marks if you include extra experiments, presentations,

etc.. that are relevant to the project (with sufficient explanations); see Appendix A for some recom-

mendations.

1.4 Submission

This is an individual project. While discussions regarding how to solve the problems is encouraged, students

should answer the problems on their own (just like your HWs). The deadline of submission is May 13 (Friday), 2022,

23:59 (HKT). Please submit with the following content to Blackboard:

• Your Project Report in PDF format.

• Your Program Codes [either in Jupyter notebook (.ipynb), or Julia code (.jl)].

In addition, the project report shall be submitted to VeriGuide for plagiarism check.

A Additional Information

Suggestions — The below are only suggestions for improving the performance of your classifier design in Task

8. Of course, you are more than welcomed to propose and explore new ideas (but still, make sure that they

are mathematically feasible)!

• Formulation Aspect – Here are some tricks to tweak the performance of your classifier design in Task 8.

1. The design of the weights {`i}mi=1 maybe crucial to the performance of your classifier. For instance, we

observe that the gradient of f̂(w, b) can be written as

∇f̂(w, b) = 1

m

m∑

i=1

`i · ∇ log

(

1 + e−y

(i)((x(i))>w+b)

)

You may observe that the latter term ∇ log(1 + e−y(i)((x(i))>w+b)) is dependent on x(i). Especially, try to

imagine if ‖x(i)‖ is extremely large7? To this regard, a possibly remedy is to appropriately design `i to

‘downplay’ the effect of those samples with large ‖x(i)‖ on the classifier design.

(Mind that such a pre-processing step to design `i will cost you one access to the training dataset).

2. As an alternative approach, we recall from Task 2 that the optimization’s goal is to minimize the

number of training samples that violate the constraints in (1.3). We may replace the logistic loss function

by the sigmoid loss function, given by:

f̂(w, b) =

1

m

m∑

i=1

`i

1

1 + ey

(i)((x(i))>w+b)

which puts less ‘penalty’ on solutions with y(i)((x(i))>w + b) 1 than the logistics loss. Note that the

above function is not convex in (w, b), yet the projected gradient descent method continues to work, e.g., as

shown in [Beck, 2017].

7The sample with extremely large ‖x(i)‖ is sometimes called an outlier.

FTEC2101/ESTR2520 Project 12

3. To reduce the weight of a particular feature in the classifier, you may adopt a penalization approach

(similar to the Big-M method in principle). For example, if you want to suppress the value of w1, you may

consider the following penalized objective function:

f˜(w, b) = f̂(w, b) + α |w1|2

for some α > 0 as an adjustable penalty parameter.

• Algorithm Aspect – For Task 8, with the constraint structure given, the recommended algorithm is projected

gradient descent (PGD) method, which are described as follows. For solving a general optimization:

min

w∈Rd,b∈R

f̂(w, b) s.t. (w, b) ∈ X. (1.10)

With a slight abuse of notation, we denote x ≡ (w, b) and the PGD method can be described as

Input: x(0) ∈ X, constant step size γ > 0, max. it-

eration number Kmax.

For k = 0, ...,Kmax

x(k+1) = ProjX

{

x(k) − γ∇f̂(x(k))}

End For

PGD Method

For any (w, b) ∈ Rd × R, we have

(wP , bP ) = ProjX(w, b)

with

wP =

1

max{1, ‖w‖} w, bP = b.

Besides, you are more than welcomed to explore the use of other iterative algorithms, e.g., conditional

gradient, back tracking line search, etc., for solving the optimization at hand.

• Presentation Aspect – Here are some suggestions for discussions which you may include to enrich the content

of your report

1. To understand and compare between the hinge loss, the logistics loss, and the sigmoid loss, it may be

helpful to plot the following 1-D functions:

Hinge(x) = max{0, 1− x}, Logistics(x) = log(1 + e−x), Sigmoid(x) = 1

1 + ex

.

and discuss on how they put penalty on situation when x is large.

2. The performance of binary classification can be measured more precisely by considering the precision rate

and recall rate. For a given labeled dataset (e.g., the testing dataset), they are defined respectively as:

Precision =

No. of samples with y

(i)

test = yˆ

(i) = 1

No. of samples with yˆ(i) = 1

, Recall =

No. of samples with y

(i)

test = yˆ

(i) = 1

No. of samples with y

(i)

test = 1

Note that y

(i)

test = yˆ

(i) = 1 means that the ith sample is a customer who defaulted and the classifier correctly

identifies him/her as a customer who will default.

Lastly, tips for implementing the MI-NLP, SOCP, etc.. in the compulsory tasks have been included with the

the template program. Be reminded that it is not necessary to follow all the tips therein.

References

A. Beck. First-order methods in optimization. SIAM, 2017.

C. D. Sa. Support Vector Machine, Cornell University CS4780, 2022. URL https://www.cs.cornell.edu/

courses/cs4780/2022sp/notes/LectureNotes13.html.

Project Specification – Binary Classification

Last Updated: April 25, 2022, Deadline: May 13, 2022, 23:59 (HKT)

Thus far we have learnt a number of optimization methods, ranging from the simplex method for linear

programming, modeling techniques for integer programming, gradient/Newton methods for unconstrained

optimization, KKT conditions, SOCP formulation, etc.. Besides theories, please remember that optimization

methods are practical tools for solving real world problems. The aim of this project is to practice our skill on

the latter aspect. We will make use of the Julia1 environment.

The project’s theme is about practical solutions for binary classification. It can be considered as an extension to

what we have experimented in Computing Lab 2. We will explore additional aspects about binary classification

and more importantly, implement and compare these ideas on a real-world dataset.

Background. Let us recall that the binary classification task aims at learning a linear classifier that distin-

guishes a feature vector as positive (+1) or negative (−1). Let d ∈ N be the dimension of the feature vector,

the (linear) classifier is described by the tuple (w, b) where w ∈ Rd is the direction parameters and b ∈ R is

the bias parameter2, both specifying the linear model. Given (w, b), a feature vector x ∈ Rd is classified into

a label y ∈ {±1} such that

y =

{

+1, if w>x+ b = w1x1 + · · ·+ wdxd + b ≥ 0,

−1, if w>x+ b = w1x1 + · · ·+ wdxd + b < 0.

(1.1)

As an illustration, the following figure shows a case with d = 2 such that the classifier is described by

w = (w1, w2) and the scalar b:

I

"

\ ✗ ✗ ✗

ii.

+

+

✗

✗

1-

✗

\ ✗

y l

l

✗ X '

e Wixitwzlz -112--0

y ✗ ,

i

,

2×1

According to the rule in (1.1), all the feature vectors that lie above the dashed line shall be classified as +1

(blue points); those that lie below the dashed line shall be classified as −1 (red points).

1As an alternative, you are welcomed to use Python with optimization modeling packages supporting SOCP and MI-NLP such

as cvxpy. However, you have to notify the instructor about the choice and the plan on how you wish to accomplish the project

in the latter case on or before May 1, 2022, i.e., two weeks before the deadline. In the project, you are not allowed to use any

pre-built solver for binary classification such as MLJ.jl, ScikitLearn.jl, flux.jl scikit-learn as your final submission (though

you are encouraged to try out these packages as extra work to support your solution).

2Note that this differs slightly from Lecture 16 as we include the bias parameter in the classifier design.

1

FTEC2101/ESTR2520 Project 2

Dataset & Folder Setting We have m ∈ N samples of training data described by {x(i), y(i)}mi=1, where

x(i) ∈ Rd is the ith feature vector and y(i) ∈ {±1} is the associated label. We will use a curated version

of the GiveMeSomeCredits dataset taken from https://www.kaggle.com/c/GiveMeSomeCredit. It includes

customer data such as their debt ratios, credit utilization, age, etc., and the history of defaulting. With these

information, our goal is to learn a classifier that predicts if a customer has any risk of defaulting before the

bank approves loan(s) for him/her.

To prepare for the project, please retrieve the Jupyter notebook template ftec2101-project-template-2022.ipynb

and the zip archive ftec project files.zip from the Blackboard. Place the *.ipynb files in a working direc-

tory, extract the *.zip file and move the *.csv files inside the same working directory. Your working directory

should have the following content:

Notice that:

• ftec-cs-groupi-train.csv – the training dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv

file that contains 20 samples of customer data to be used in the Compulsory Tasks.

• ftec-cs-groupi-test.csv – the testing dataset for students in group i, i = 1, 2, 3, 4, 5. This is a csv

file that contains 30 samples of customer data to be used in the Compulsory Tasks.

• ftec-cs-full-train.csv – the training dataset that contains 16657 samples of customer data to be

used in the Competitive Tasks.

• ftec-cs-full-test.csv – the training dataset that contains 34870 samples of customer data to be used

in the Competitive Tasks.

Lastly, the Jupyter notebook ftec2101 project 2022.ipynb provided detailed descriptions and helper codes

to guide you through the tasks required by this project. Please pay attention to the comments provided inside

the code cells of the Jupyter notebook as well.

1.1 Compulsory Tasks (50%)

The compulsory tasks of this project is divided into two parts. You are required to

(A) answer questions related to the optimization theory and modeling related to binary classification; and

(B) implement the modeled optimization problems on computer and solve them with the GiveMeSomeCredits

dataset.

FTEC2101/ESTR2520 Project 3

Theory and Modeling Denote the training dataset withm samples as {x(i), y(i)}mi=1, where x(i) ∈ Rd

is the ith feature vector and y(i) ∈ {±1} is the associated label. Let λ > 0 be a fixed scalar. The following

optimization problem designs what is known as the hard-margin classifier [Sa, 2022]:

min

w∈Rd,b∈R

1

2

w>w

s.t. y(i)

(

(x(i))>w + b

) ≥ 1, i = 1, . . . ,m. (1.2)

It can be easily shown that (1.2) is a convex optimization problem.

Task 1: Hard-margin Formulation (7.5%)

Answer the following question:

(a) Explain how solving (1.2) finds a classifier (w?, b?) that can correctly distinguish the m training

samples into the +1 or −1 labels. (Hint: focus on the constraints in (1.2) and compare it to (1.1)

as well as the figure that explains it.)

(b) Derive the KKT condition for (1.2). Using the KKT condition, show that the optimal w? is a linear

combination of x(i) for the samples i such that 1 = y(i)((x(i))>w? + b?).

(c) Give an example of training dataset with d = 2 where (1.2) is infeasible. You may describe such

dataset by drawing it on a 2-D plane.

Due to the issue of infeasibility as investigated in part (c) of Task 1, the hard-margin formulation (1.2) is

seldom used in practice. Instead, we consider a soft-margin formulation, which introduces a non-negative

‘auxiliary decision variable’3 ξ(i) to each of the constraint, yielding:

y(i)

(

(x(i))>w + b

) ≥ 1− ξi, ξi ≥ 0, i = 1, . . . ,m. (1.3)

Instead of forcing y(i)

(

(x(i))>w + b

) ≥ 1 for all training samples, i.e., requiring all the training samples are

correctly classified, the above constraint with the auxiliary variable ξi allows for erroneous classification.

The auxiliary variable ξi is interpreted as the amount of error for the ith sample. Notice that ξi > 0 if and

only if the ith training sample is mis-classified for the sample, i.e., y(i)

(

(x(i))>w + b

)

1. Ideally, we should

find (w, b) such that the number of samples i with ξi > 0 is minimized.

Below, you will formulate a mixed integer program (in fact, a mixed-integer non-linear program, MI-NLP)

optimization problem for the soft constraint optimization.

Task 2: Mixed Integer Formulation (5%)

Answer the following question: Based on the soft-margin constraint (1.3) [as well as the max-margin

problem (1.2)], formulate an optimization problem with the following specification:

• The objective is to minimize the total number of auxiliary decision variables ξi with ξi > 0

(i.e., the total number of errors).

• The amount of error is bounded such that

0 ≤ ξi ≤M, i = 1, ...,m,

3Note that it bears some resemblance to the artificial variable technique we learnt in the Big-M simplex method.

FTEC2101/ESTR2520 Project 4

where M is a known/given constant.

• The soft-margin constraints (1.3) are satisfied.

• The directional parameters w ∈ Rd satisfies the following shaping constraint:

w>Σw + c>w ≤ 1,

where Σ ∈ Rd×d is a given symmetric, positive definite matrix, and c ∈ Rd is a given vector.

The MIP is difficult to solve for large-scale problems with m 1. As a remedy, we shall approximate the

objective function in Task 2 using a linear function in {ξi}mi=1. Consider the following task:

Task 3: SOCP Formulation (5%)

Answer the following question:

(a) Based on the soft-margin constraint (1.3) [and the max-margin problem (1.2)], formulate an opti-

mization problem with the following specifications (notice that unlike in Task 2, we do not impose

the constraint ξi ≤M):

• The objective is to minimize the weighted sum of constraint violation over all training sample:

m∑

i=1

`iξ

(i)

where `i > 0, i = 1, ...,m is a set of given weights.

• The soft-margin constraints (1.3) are satisfied.

• The directional parameters w ∈ Rd satisfies the following shaping constraint:

w>Σw + c>w ≤ 1,

where Σ ∈ Rd×d is a given symmetric, positive definite matrix and c ∈ Rd is a given vector.

(b) Show that the problem in part (a) can be reformulated as an SOCP.

Computational We next focus on solving the optimization problems formulated in Task 2 & 3 on

computers. We structure the tasks into 3 stages: data analysis, optimization, interpretation of results. The

Jupyter notebook provided descriptions and helper codes to guide you through most of the following tasks.

Please pay attention to the comments in the Jupyter notebook as well.

Data Analysis. In the compulsory tasks, we will focus on a small dataset that only consist of m = 20 samples

(stored in ftec-cs-groupi-train.csv). The training dataset contains the information of

”RevolvingUtilizationOfUnsecuredLines”, ”age”, ”DebtRatio”, ”MonthlyIncome”,

”NumberOfOpenCreditLinesAndLoans”, ”NumberRealEstateLoansOrLines”, ”NumberOfDependents”

for m = 20 customers. Notice that the dataset have been preprocessed such that the above features are

normalized with typical values ranging between 0 and 1 (there are some exceptions though).

FTEC2101/ESTR2520 Project 5

Moreover, the training dataset contains the information of whether the customer has a history of “Default”

or not. We treat the latter information as the label yi to train our classifier to decide if the customer is worth

granting the grant or not.

We shall first get ourselves familiar with the dataset. In particular, we shall visualize the dataset to observe

the behavior pattern of a customer with or without a ‘Default’ history.

Task 4: Warmup Exercise (5%)

(a) Inspect the dataset by making a 2-D scatter plots of the 20 samples over the features

‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’.

Mark the data points of ‘Default’ (resp. ‘Non-default’) customers in red (resp. blue). Comment on

the pattern observed for ‘Default’ customers.

(b) Try 2-3 more combinations of pairs of features and make comments on the observations.

Remark: The program template has provided the relevant helper codes for this task, but you may have

to ‘tweak’ the template to examine other pairs of features in part (b).

For part (a), you may observe a similar output to the following:

Classifier Optimization. In the following tasks, you will implement classifier designs based on MI-NLP and

SOCP from Task 2 & 3.

As a first step, we shall design the classifier based on only two features in the data, namely,

‘RevolvingUtilizationOfUnsecuredLines’ and ‘NumberOfOpenCreditLinesAndLoans’. (1.4)

In other words, we consider the classifier design problems [cf. Task 2 & 3] with d = 2 first.

Task 5: Optimization-based Formulation (17.5%)

(a) Based on the selected features in (1.4), implement and solve the MI-NLP from Task 2 with the

FTEC2101/ESTR2520 Project 6

following parameters:

M = 100, Σ = 1200

(

1 0

0 1

)

, c =

(

0

0

)

, `i = 1, i = 1, . . . ,m.

You may use the solver Juniper (with Ipopt) in JuMP for tackling the MI-NLP.

(b) Based on the selected features in (1.4), implement and solve the SOCP from Task 3 with the following

parameters:

Σ = 1200

(

1 0

0 1

)

, c =

(

0

0

)

, `i = 1, i = 1, . . . ,m.

You may use the solver ECOS in JuMP for tackling the SOCP.

(c) Make a scatter plot of the training dataset for the features in (1.4) as in Task 4-(a). Then, overlay

the linear classifiers found in part (a) and part (b) on top of this scatter plot. Comment on whether

the classifiers found are reasonable.

Notice that it may take a long time to solve the MI-NLP in Task 5-(a) since an MI-NLP problem is quite

challenging to solve in general. The scatter plot in Task 5-(c) may look like

which indicate the classifiers found by the MIP and SOCP formulations.

Similar to the Computing Lab 2, the performance of a classifier can be evaluated by the error rate when

applied on a certain set of data. For instance, recalling that {x(i), y(i)}mi=1 is the training dataset, the training

error rate for a given classifier (w, b) can be estimated as

Training Error Rate =

1

m

m∑

i=1

1(y(i) 6= yˆ(i)) where yˆ(i) =

{

+1, if w>x(i) + b ≥ 0,

−1, if w>x(i) + b < 0,

i.e., it is the number of errors made by the classifier divided by the number of samples. Notice that 0 ≤

Training Error Rate ≤ 1.

As our aim is to design a classifier that makes prediction on whether a future customer (i.e., not found in the

training dataset) will ‘default’ on his/her loans, it is necessary to evaluate the error rate on a testing dataset

that is unseen by the program. Denote the testing dataset with mtest samples as {x(i)test, y(i)test}mtesti=1 , the testing

FTEC2101/ESTR2520 Project 7

error rate for a classifier (w, b) can be estimated as

Testing Error Rate =

1

mtest

mtest∑

i=1

1(y

(i)

test 6= yˆ(i)) where yˆ(i) =

{

+1, if w>x(i)test + b ≥ 0,

−1, if w>x(i)test + b < 0.

To this end, we consider the following task:

Task 6: Error Performance (5%)

Notice that for our project, the testing dataset is stored in ftec-cs-groupi-test.csv.

(a) Write a function to evaluate the testing/training error rate as defined in the above.

(b) Evaluate and compare the error rate performances for the two classifiers you have found in Task 5.

Remark : Try to make the function for evaluating error in (a) general such that it takes dataset of any

size and features of any dimension. You will have to reuse the same function in Task 7 & 8.

Intuitively, the performance of a classifier can be improve when more features are included. As the last task

in the compulsory section, we shall use all the d = 7 features of the GiveMeSomeCredit dataset to design our

classifier:

Task 7: Full Classifier (5%)

(a) Repeat Task 5-(a), 5-(b), but design the classifiers based on all the d = 7 features with the following

parameters:

M = 100, Σ = 1200I7, c = 0, `i = 1, i = 1, . . . ,m.

where I7 is the 7× 7 identity matrix, and 0 is a 7× 1 all zero vector. By inspecting the optimized

classifier from the SOCP, describe which features will be important for deciding if a customer is

likely to default. Comment on whether this observation is reasonable or not.

(b) Evaluate and compare the error rate performances for the two classifiers you have found. Is the

training error rate better than the classifier based on only d = 2 features in Task 5? What about

the testing error rate? Comment on your observations.

(c) Change parameters such as adjusting the shaping constraint, e.g., change Σ to 110I7 and observe its

effects on the error rate, etc. Comment on your observations.

1.2 Competitive Tasks (30%)

Unlike the compulsory tasks, the goal of this competitive task is to implement your own (approximate) solver

to the binary classifier problem, without relying on JuMP and its optimizers such as ECOS, Juniper, etc.. To

motivate, we observe that while optimization packages such as JuMP are convenient to use, they are often

limited by scalability to large-scale problems when the number of training samples m 1 and/or the feature

is high dimensional d 1. The task would require considerably more advanced coding skills.

Our objectives are to find the linear classifier with the best training/testing error and the lowest computation

cost using a custom-made iterative algorithm such as projected gradient descent. With the shaping constraint

defined as:

Σ = I7, c = 0.

FTEC2101/ESTR2520 Project 8

Under the above specialization, the SOCP problem in Task 3-(a) will be equivalent to:

min

w∈Rd,b∈R

f(w, b) :=

1

m

m∑

i=1

`i max

{

0, 1− y(i)((x(i))>w + b)}

s.t. w>w ≤ 1,

(1.5)

Note that the above problem admits a constraint set for (w, b) ∈ Rd+1 as

X = {w ∈ Rd, b ∈ R : w>w ≤ 1}. (1.6)

An obstacle in applying gradient-based algorithms on (1.5) lies on the non-differentiability of f(w; b). Taking

this viewpoint, we shall consider some differentiable function f̂(w, b) that approximates f(w, b) (to be specified

later) and solve the constrained optimization:

min

w∈Rd,b∈R

f̂(w, b) s.t. (w, b) ∈ X. (1.7)

It should be easy to apply iterative algorithms such as projected gradient descent (PGD), conditional gradient,

etc. (see Appendix A), to tackle the above problem.

Among other options, a potential candidate for approximating f(w; b) is to consider the logistic loss4 as we

have done in Lecture 16 / Computing Lab 2:

f̂(w, b) =

1

m

m∑

i=1

`i log

(

1 + exp

(

−y(i)((x(i))>w + b)

))

. (1.8)

Minimizing the above function leads to a solution (w, b) such that y(i)((x(i))>w+b) is large for all i = 1, ...,m,

which makes a desired feature for a good classifier.

Task 8: Customized Solver for Classifier Optimization (30%)

Using the dataset with the training data from m = 16657 samples in ftec-cs-full-train.csv. Imple-

ment an iterative algorithm to tackle (1.7). Notice that due to (1.6) the optimized classifier must satisfy

w>w ≤ 1. Moreover, you are required to initialize your algorithm by w0 = 0, b0 = 0.

Suggestion: As the first attempt, you may consider the projected gradient descent (PGD) method using

a constant step size with f̂(w, b) selected as the logistic function (1.8). You may need to tune the weights

{`i}mi=1 to obtain a better result while paying attention to the ‘size’ (i.e., ‖x(i)‖) for each of the feature

vector. Other candidates are also suggested in Appendix A.

Assessment You will receive a maximum of 10% for correctly implementing at least one numerical algo-

rithm (e.g., projected gradient), together with

1. plotting the trajectory of the algorithm is show that the objective value in (1.7) to be decreasing to a

certain value asymptotically and providing comments on the algorithm(s) implemented,

2. providing derivations and justifications on why the implemented algorithm is used.

Besides achieving a low training/testing error rates, a good binary classifier should also be efficiently com-

putable. As we are dealing with a relatively large dataset, the complexity of your algorithm will be measured

by the number of accesses to the training dataset5. Every time when you evaluate the value of the

4Notice that the logistic objective function can be interpreted alternatively as a formulation for the classifier design task with

the maximum-likelihood (ML) principle from statistics. This is beyond the scope of this project specification.

5This is sometimes known as the sample complexity.

FTEC2101/ESTR2520 Project 9

objective function f̂(w, b), or the gradient of f̂(w, b), will incur one access to the training dataset.

For instance when you are applying a projected GD method with constant step size for 1000 iterations. As

the algorithm evaluates the gradient of f̂(w, b) only once per iteration, the total number of accesses to the

training dataset will be equal to 1000.

Secondly, another desired property is that the binary classifier should not be biased towards a particular

feature. In particular, here we wish to design a classifier that does not decide on the risk of default for

customer entirely on the feature ‘RevolvingUtilizationOfUnsecuredLines’. Suppose this feature is denoted via

the first coordinate of xt, i.e., [xt]1, in your program, we can measure the biasedness towards it by:

% weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’ =

|w1|∑7

i=1 |wi|

Finally, the remaining 20% of your marks in this task will be calculated according to the following formula:

Score = 5%× exp

(

10 ·max{0.25,Lowest Training Error Rate} − 10 ·max{0.25,Your Training Error Rate}

)

+ 5%× exp

(

10 ·max{0.3,Lowest Testing Error Rate} − 10 ·max{0.3,Your Testing Rate}

)

+ 5%× max{0.25,Lowest % weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’}

max{0.25,Your % weight of classifier on ‘RevolvingUtilizationOfUnsecuredLines’}

+ 5%× max{75,Lowest total number of accesses to training dataset}

max{75,Your total number of accesses to training dataset} . (1.9)

The lowest errors (resp. lowest no. of accesses to training dataset, etc.) are the highest one (resp. lowest one)

among the class of FTEC21016. Some tips for improving the performance of your design can be found in

Appendix A.

If you have tried more than one algorithm and/or more than one type of approximation, you have to select

only one set of classifier parameters (w, b) for consideration of the competition via (1.9). Please indicate

which solution is selected in your report and include that in the submission of your program files. That said,

you are encouraged to try more of these different variants and include them in the project report. Moreover,

observe the following rules:

• The algorithms you designed are not allowed to directly optimize on the testing set data. In other

words, your iterative algorithm should not rely on any data in ftec-cs-full-test.csv as you are not

supposed to see the ‘future customer’ data while training a classifier for loan applicants. Your score in

(1.9) will be set to zero if we detect such ‘cheating’ behavior. However, you can evaluate the test error

performance of your solution as many time as you like before you found the best setting.

• Your selected algorithm for the competition must be deterministic. In other words, you may not use

stochastic algorithm such as stochastic gradient descent (SGD) for the competition. That being said,

you are still encouraged to try such algorithms as an additional task which may be counted towards the

‘innovation’ section.

If you have questions about the rules, please do not hesitate to consult the instructor at htwai@cuhk.edu.hk

or the TA or raise your questions on Piazza.

6The scores for ESTR2520 students will be calculated by taking the best error performance across both ESTR2520 and

FTEC2101 students.

FTEC2101/ESTR2520 Project 10

1.3 Report (20%)

You are required to compile a project report with answers to the questions posed in Task 1 to Task 8. You

may structure the report according to the order of the tasks, for instance:

1. Background and Introduction — In this section, you can briefly introduce the problem, e.g., explain-

ing the goal of classifier design, discussing the role of optimization methods in tackling the problem.

2. Model and Theory — In this section, you can discuss how the classifier design problem is modeled as

optimization problems. More specifically,

– You shall begin by discussing the hard-margin formulation (1.2) and then answer Task 1.

– Next, you can describe the optimization models with soft-margin and then answer Tasks 2 & 3.

3. Experiments — In this section, you describe the experiments conducted to test your formulation, i.e.,

– You shall first describe the dataset by answering Task 4. In addition, it is helpful to describe a few

properties regarding the dataset, e.g., the size of the dataset, the range of the values for the different

features.

– Then, you can describe the experiments for each of the 2 formulations by answering Tasks 5 & 6.

– Finally, you can compare the formulations by answering Task 7.

4. Competitive Task — In this section, you describe the custom solver you built to solve the large-scale

classifier design problem, i.e.,

– You shall first describe the approximation technique as laid out in the discussion of (1.5), (1.8).

– Then, you shall describe the iterative algorithm you have derived in Task 8.

– Apply the iterative algorithm on the complete training dataset and show the objective value vs. iter-

ation number. Discuss whether the algorithm converges and report on the performance of the designed

classifier.

5. Conclusions — In this section, you shall summarize the findings in the project, and discuss various

aspects that can be improved with the formulation, etc..

Throughout the report, please feel free to write your answer which involves equations (e.g., Task 1-3) on a

paper and scan it to your Word/PDF report as a figure. For Task 4 to 8, please include all the plots and

comments as requested. For Task 8, please indicate the Training error, Testing error, No. of access to training

dataset of the selected solution, % weight on ‘RevolvingUtilizationOfUnsecuredLines’ (we will also run your

code to verify the values reported).

The program code has to be submitted separately. However, you are welcomed to use excerpts from the

program codes in the report if you find it helpful for explaining your solution concepts.

Lastly, you are welcomed to use online resources when preparing the project. However, you must give proper

references for sources that are not your original creation.

Assessment Here is a breakdown of the assessment metric for the report writing component.

• (10%) Report Writing : A project report shall be readable to a person with knowledge in optimization

(e.g., your classmates in FTEC2101/ESTR2520). Make sure that your report is written with clarity, and

more importantly, using your own language!

FTEC2101/ESTR2520 Project 11

• (10%) Innovation: You can get innovation marks if you include extra experiments, presentations,

etc.. that are relevant to the project (with sufficient explanations); see Appendix A for some recom-

mendations.

1.4 Submission

This is an individual project. While discussions regarding how to solve the problems is encouraged, students

should answer the problems on their own (just like your HWs). The deadline of submission is May 13 (Friday), 2022,

23:59 (HKT). Please submit with the following content to Blackboard:

• Your Project Report in PDF format.

• Your Program Codes [either in Jupyter notebook (.ipynb), or Julia code (.jl)].

In addition, the project report shall be submitted to VeriGuide for plagiarism check.

A Additional Information

Suggestions — The below are only suggestions for improving the performance of your classifier design in Task

8. Of course, you are more than welcomed to propose and explore new ideas (but still, make sure that they

are mathematically feasible)!

• Formulation Aspect – Here are some tricks to tweak the performance of your classifier design in Task 8.

1. The design of the weights {`i}mi=1 maybe crucial to the performance of your classifier. For instance, we

observe that the gradient of f̂(w, b) can be written as

∇f̂(w, b) = 1

m

m∑

i=1

`i · ∇ log

(

1 + e−y

(i)((x(i))>w+b)

)

You may observe that the latter term ∇ log(1 + e−y(i)((x(i))>w+b)) is dependent on x(i). Especially, try to

imagine if ‖x(i)‖ is extremely large7? To this regard, a possibly remedy is to appropriately design `i to

‘downplay’ the effect of those samples with large ‖x(i)‖ on the classifier design.

(Mind that such a pre-processing step to design `i will cost you one access to the training dataset).

2. As an alternative approach, we recall from Task 2 that the optimization’s goal is to minimize the

number of training samples that violate the constraints in (1.3). We may replace the logistic loss function

by the sigmoid loss function, given by:

f̂(w, b) =

1

m

m∑

i=1

`i

1

1 + ey

(i)((x(i))>w+b)

which puts less ‘penalty’ on solutions with y(i)((x(i))>w + b) 1 than the logistics loss. Note that the

above function is not convex in (w, b), yet the projected gradient descent method continues to work, e.g., as

shown in [Beck, 2017].

7The sample with extremely large ‖x(i)‖ is sometimes called an outlier.

FTEC2101/ESTR2520 Project 12

3. To reduce the weight of a particular feature in the classifier, you may adopt a penalization approach

(similar to the Big-M method in principle). For example, if you want to suppress the value of w1, you may

consider the following penalized objective function:

f˜(w, b) = f̂(w, b) + α |w1|2

for some α > 0 as an adjustable penalty parameter.

• Algorithm Aspect – For Task 8, with the constraint structure given, the recommended algorithm is projected

gradient descent (PGD) method, which are described as follows. For solving a general optimization:

min

w∈Rd,b∈R

f̂(w, b) s.t. (w, b) ∈ X. (1.10)

With a slight abuse of notation, we denote x ≡ (w, b) and the PGD method can be described as

Input: x(0) ∈ X, constant step size γ > 0, max. it-

eration number Kmax.

For k = 0, ...,Kmax

x(k+1) = ProjX

{

x(k) − γ∇f̂(x(k))}

End For

PGD Method

For any (w, b) ∈ Rd × R, we have

(wP , bP ) = ProjX(w, b)

with

wP =

1

max{1, ‖w‖} w, bP = b.

Besides, you are more than welcomed to explore the use of other iterative algorithms, e.g., conditional

gradient, back tracking line search, etc., for solving the optimization at hand.

• Presentation Aspect – Here are some suggestions for discussions which you may include to enrich the content

of your report

1. To understand and compare between the hinge loss, the logistics loss, and the sigmoid loss, it may be

helpful to plot the following 1-D functions:

Hinge(x) = max{0, 1− x}, Logistics(x) = log(1 + e−x), Sigmoid(x) = 1

1 + ex

.

and discuss on how they put penalty on situation when x is large.

2. The performance of binary classification can be measured more precisely by considering the precision rate

and recall rate. For a given labeled dataset (e.g., the testing dataset), they are defined respectively as:

Precision =

No. of samples with y

(i)

test = yˆ

(i) = 1

No. of samples with yˆ(i) = 1

, Recall =

No. of samples with y

(i)

test = yˆ

(i) = 1

No. of samples with y

(i)

test = 1

Note that y

(i)

test = yˆ

(i) = 1 means that the ith sample is a customer who defaulted and the classifier correctly

identifies him/her as a customer who will default.

Lastly, tips for implementing the MI-NLP, SOCP, etc.. in the compulsory tasks have been included with the

the template program. Be reminded that it is not necessary to follow all the tips therein.

References

A. Beck. First-order methods in optimization. SIAM, 2017.

C. D. Sa. Support Vector Machine, Cornell University CS4780, 2022. URL https://www.cs.cornell.edu/

courses/cs4780/2022sp/notes/LectureNotes13.html.