Python代写-FIN7790-Assignment 1
时间:2022-01-19
FIN7790 – Assignment 1 – Predicting Loan Default Page 1 of 6
12/23/202
20% of your final grade
Predicting Loan Default
DUE Saturday 22nd January, 2022 9:00am
(Please submit via Moodle. A penalty of 5 marks per day will be imposed on late submissions.)
What is the Goal
This assignment is a group assignment with an aim to get familiar with the process of developing a machine
learning model for a financial application. You are expected to make use of the provided loan data, explore the
data, pre-process the data for model training, train various machine learning models using the pre-processed
data, evaluate the performance of the models, and use the selected model to make prediction. The model will
be developed using Python and the sci-kit learn & other relevant libraries.
Lending
Lending is one of the most important activities of the finance industry. Lenders provide loans to borrowers in
exchange for the promise of repayment with interest. That means the lender makes a profit only if the
borrower pays off the loan. Hence, the two most critical questions in the lending industry are:
◦ How risky is the borrower?
◦ Given the borrower’s risk, should a loan be approved?
Default prediction could be described as a perfect job for machine learning, as the algorithms can be trained on
millions of examples of consumer data. Algorithms can perform automated tasks such as matching data
records, identifying exceptions, and calculating whether an applicant qualifies for a loan. The underlying trends
can be assessed with algorithms and continuously analysed to detect trends that might influence lending and
underwriting risk in the future.
The goal of this assignment is to build machine learning models to predict the probability that a loan will default.
In most real-life cases, including loan default modelling, we are unable to work with clean, complete data.
Some of the potential problems we are bound to encounter are missing values, incomplete categorical data,
and irrelevant features. Although data cleaning may not be mentioned often, it is critical for the success of
machine learning applications. The algorithms that we use can be powerful, but without the relevant or
appropriate data, the system may fail to yield ideal results. One focus of this assignment is data preparation
and cleaning. Various techniques and concepts of data processing, feature selection, and exploratory data
analysis are used for data cleaning and organizing the feature space.
Loan Dataset
This assignment will be based on loan data from LendingClub Bank (https://www.lendingclub.com/). The bank
is a US peer-to-peer lending company. It is the first peer-to-peer lender to register its offerings as securities with
the US Securities and Exchange Commission (SEC), and to offer loan trading on a secondary market. It is the
world's largest peer-to-peer lending platform that enables borrowers to obtain a loan and investors to purchase
Notes (a product no longer available) backed by payments made on these loans.
The dataset from 2007 to 2018 contains 100K observations each with 150+ variables. Given the historical
data on loans given out with information on whether or not the borrower defaulted (charge-off), you are
expected to build a number of machine learning models (in Python) that can predict if a borrower will pay back
FIN7790 – Assignment 1 – Predicting Loan Default Page 2 of 6
12/23/2021
the loan. This way in the future when the company gets a new potential customer, it can assess if the
potential customer is likely to pay back the loan.
In the loan dataset, the target variable is “loan_status”. The target variable can actually carry different values.
You are going to try to learn differences in the features between completed loans that have been fully paid or
charged off. You won't consider loans that are current, don't meet the credit policy, defaulted, or have a
missing status. So you only keep the loans with status "Fully Paid" or "Charged Off". A charged off debt is a
debt that a creditor has given up trying to collect on after a borrower has missed payments for several months.
The table below lists the variables in the dataset.
# Variable Name Description
1 acc_now_delinq Number of accounts on which the borrower is now delinquent.
2 acc_open_past_24mths Number of accounts opened in past 24 months.
3 addr_state The US state provided by the borrower.
4 all_util Balance to credit limit on all accounts.
5 annual_inc Annual income provided by the borrower.
6 annual_inc_joint The combined self-reported annual income.
7 application_type Whether the loan is an individual application or a joint application with two co-borrowers.
8 avg_cur_bal Average current balance of all accounts.
9 bc_open_to_buy Total open to buy on revolving bankcards.
10 bc_util Ratio of total current balance to high credit/credit limit for all bankcard accounts.
11 chargeoff_within_12_mths Number of charge-offs within 12 months.
12 collection_recovery_fee Post-charge off collection fee.
13 collections_12_mths_ex_med Number of collections in 12 months excluding medical collections.
14 delinq_2yrs The number of 30+ days past-due incidences of delinquency in the borrower's credit file for the
past 2 years.
15 delinq_amnt The past-due amount owed for the accounts on which the borrower is now delinquent.
16 desc Loan application description provided by the borrower.
17 dti A ratio calculated using the borrower’s total monthly debt payments on the total debt
obligations (excluding mortgage and the requested loan), divided by the borrower’s self-
reported monthly income.
18 dti_joint A ratio calculated using the co-borrowers' total monthly payments on the total debt obligations
(excluding mortgages and the requested loan), divided by the co-borrowers' combined self-
reported monthly income.
19 earliest_cr_line The month the borrower's earliest reported credit line was opened.
20 emp_length Employment length in years (from < 1 year to 10+ years).
21 emp_title Job title provided by the borrower.
22 fico_range_high The upper boundary of the borrower’s FICO score (https://www.myfico.com/credit-
education/what-is-a-fico-score).
23 fico_range_low The lower boundary of the borrower’s FICO score (https://www.myfico.com/credit-
education/what-is-a-fico-score).
24 funded_amnt The amount committed to the loan.
25 funded_amnt_inv The amount committed by investors for that loan.
26 grade Grade assigned to the loan.
27 home_ownership Home ownership status either provided by the borrower or obtained from the credit report.
28 id A unique ID for the loan.
29 il_util Ratio of total current balance to high credit/credit limit on all install accounts.
30 initial_list_status The initial listing status of the loan.
31 inq_fi Number of personal finance inquiries.
32 inq_last_12m Number of credit inquiries in past 12 months.
33 inq_last_6mths Number of inquiries in past 6 months (excluding auto and mortgage inquiries).
34 installment Monthly payment owed by the borrower upon loan approval.
35 int_rate Interest rate on the loan.
36 issue_d The month in which the loan was funded.
37 last_credit_pull_d The most recent month the lender pulled credit for this loan.
38 last_fico_range_high The upper boundary of the borrower’s last FICO score.
39 last_fico_range_low The lower boundary of the borrower’s last FICO score.
40 last_pymnt_amnt Amount received in the last payment.
41 last_pymnt_d The month the last payment was received.
42 loan_amnt The loan amount applied for by the borrower. If the credit department reduces the loan amount,
it will be reflected in this value.
43 loan_status Loan status (including if it is a charged off loan).
44 max_bal_bc Maximum current balance owed on all revolving accounts.
45 member_id A unique member Id for the borrower.
46 mo_sin_old_il_acct Number of months since oldest bank installment account opened.
47 mo_sin_old_rev_tl_op Number of months since oldest revolving account opened.
FIN7790 – Assignment 1 – Predicting Loan Default Page 3 of 6
12/23/2021
48 mo_sin_rcnt_rev_tl_op Number of months since most recent revolving account opened.
49 mo_sin_rcnt_tl Number of months since most recent account opened.
50 mort_acc Number of mortgage accounts.
51 mths_since_last_delinq Number of months since the borrower's last delinquency.
52 mths_since_last_major_derog Number of months since most recent 90-day or worse rating.
53 mths_since_last_record Number of months since the last public record.
54 mths_since_rcnt_il Months since most recent installment accounts opened.
55 mths_since_recent_bc Number of months since most recent bankcard account opened.
56 mths_since_recent_bc_dlq Number of months since most recent bankcard delinquency.
57 mths_since_recent_inq Number of months since most recent inquiry.
58 mths_since_recent_revol_delinq Number of months since most recent revolving delinquency.
59 next_pymnt_d The month by which the next scheduled payment should be received.
60 num_accts_ever_120_pd Number of accounts ever 120 or more days past due.
61 num_actv_bc_tl Number of currently active bankcard accounts.
62 num_actv_rev_tl Number of currently active revolving accounts.
63 num_bc_sats Number of satisfactory bankcard accounts.
64 num_bc_tl Number of bankcard accounts.
65 num_il_tl Number of installment accounts.
66 num_op_rev_tl Number of open revolving accounts.
67 num_rev_accts Number of revolving accounts.
68 num_rev_tl_bal_gt_0 Number of revolving accounts with balance >0.
69 num_sats Number of satisfactory accounts.
70 num_tl_120dpd_2m Number of accounts currently 120 days past due (updated in past 2 months).
71 num_tl_30dpd Number of accounts currently 30 days past due (updated in past 2 months).
72 num_tl_90g_dpd_24m Number of accounts 90 or more days past due in last 24 months.
73 num_tl_op_past_12m Number of accounts opened in past 12 months.
74 open_acc Number of open credit lines in the borrower's credit file.
75 open_acc_6m Number of open accounts in last 6 months.
76 open_il_12m Number of installment accounts opened in past 12 months.
77 open_il_24m Number of installment accounts opened in past 24 months.
78 open_act_il Number of currently active installment accounts.
79 open_rv_12m Number of revolving accounts opened in past 12 months.
80 open_rv_24m Number of revolving accounts opened in past 24 months.
81 out_prncp Remaining outstanding principal for total amount funded.
82 out_prncp_inv Remaining outstanding principal for the portion funded by investors.
83 pct_tl_nvr_dlq Percent of accounts never delinquent.
84 percent_bc_gt_75 Percentage of all bankcard accounts > 75% of limit.
85 policy_code “1” for publicly available products. “2” for new products not publicly available.
86 pub_rec Number of derogatory public records.
87 pub_rec_bankruptcies Number of bankruptcy public records.
88 purpose The category of the loan’s intended usage chosen by the borrower in the loan request.
89 pymnt_plan Whether a payment plan has been put in place for the loan.
90 recoveries Post-charge off gross recovery.
91 revol_bal Total credit revolving balance.
92 revol_util Revolving line utilization rate, or the amount of credit the borrower is using relative to all
available revolving credit.
93 sub_grade Subgrade assigned to the loan.
94 tax_liens Number of tax liens.
95 term Number of monthly payments on the loan (can be either 36 or 60).
96 title The loan title provided by the borrower.
97 tot_coll_amt Total collection amounts ever owed.
98 tot_cur_bal Total current balance of all accounts.
99 tot_hi_cred_lim Total high credit/credit limit.
100 total_acc Total number of credit lines currently in the borrower's credit file.
101 total_bal_ex_mort Total credit balance excluding mortgage.
102 total_bal_il Total current balance of all installment accounts.
103 total_bc_limit Total bankcard high credit/credit limit.
104 total_cu_tl Number of finance accounts.
105 total_il_high_credit_limit Total installment high credit/credit limit.
106 total_pymnt Payments received to date for total amount funded.
107 total_pymnt_inv Payments received to date for the portion funded by investors.
108 total_rec_int Interest received to date.
109 total_rec_late_fee Late fees received to date.
110 total_rec_prncp Principal received to date.
111 total_rev_hi_lim Total revolving high credit/credit limit.
112 url URL for the page with the loan information.
113 verification_status Income verification status.
114 verified_status_joint Co-borrowers' joint income verification status.
115 zip_code The first 3 digits of the zip code provided by the borrower.
116 revol_bal_joint Sum of revolving credit balance of the co-borrowers, net of duplicate balances.
FIN7790 – Assignment 1 – Predicting Loan Default Page 4 of 6
12/23/2021
117 sec_app_fico_range_low FICO range (high) for the secondary applicant.
118 sec_app_fico_range_high FICO range (low) for the secondary applicant.
119 sec_app_earliest_cr_line Earliest credit line at time of application for the secondary applicant.
120 sec_app_inq_last_6mths Credit inquiries in the last 6 months at time of application for the secondary applicant.
121 sec_app_mort_acc Number of mortgage accounts at time of application for the secondary applicant.
122 sec_app_open_acc Number of open accounts at time of application for the secondary applicant.
123 sec_app_revol_util Ratio of total current balance to high credit/credit limit for all revolving accounts.
124 sec_app_open_act_il Number of currently active installment accounts at time of application for the secondary
applicant.
125 sec_app_num_rev_accts Number of revolving accounts at time of application for the secondary applicant.
126
sec_app_chargeoff_within_12_mths Number of charge-offs within last 12
months at time of application for the secondary applicant.
127
sec_app_collections_12_mths_ex_med Number of collections within last 12
months excluding medical collections at time of application
for the secondary applicant.
128
sec_app_mths_since_last_major_derog Number of months since most recent
90-day or worse rating at time of application for the
secondary applicant.
129 hardship_flag Whether or not the borrower is on a hardship plan.
130 hardship_type Describes the hardship plan offering.
131 hardship_reason Reason behind the need of a hardship plan.
132 hardship_status Whether the hardship plan is active, pending, canceled, completed, or broken.
133 deferral_term Number of months that the borrower is expected to pay less than the contractual monthly
payment amount due to a hardship plan.
134 hardship_amount Interest payment that the borrower has committed to make each month while on a hardship
plan.
135 hardship_start_date The month in which the hardship plan starts.
136 hardship_end_date The month in which the hardship plan ends.
137 payment_plan_start_date The month in which the payment plan starts.
For example, if a borrower has a hardship plan period of 3 months, the start date is the start of
the three-month period in which the borrower is allowed to make interest-only payments.
138 hardship_length Number of months the borrower will make smaller payments than normally obligated due to a
hardship plan.
139 hardship_dpd Account days past due as of the hardship plan start date.
140 hardship_loan_status Loan status as of the hardship plan start date.
141
orig_projected_additional_accrued_interest Original projected
additional interest amount that will accrue for the given hardship plan
as of
the hardship start date. This field will be null if the borrower has broken their hardship payment
plan.
142 hardship_payoff_balance_amount Payoff balance amount as of the hardship plan start date.
143 hardship_last_payment_amount Last payment amount as of the hardship plan start date.
144 disbursement_method Method by which the borrower receives their loan.
145 debt_settlement_flag Whether the borrower, who has charged-off, is working with a debt-settlement company.
146 debt_settlement_flag_date The most recent date that the debt_settlement_flag was set.
147 settlement_status The status of the borrower’s settlement plan.
148 settlement_date The month in which the borrower agreed to the settlement plan.
149 settlement_amount Loan amount that the borrower agreed to settle for.
150 settlement_percentage Settlement amount as a percentage of the payoff balance amount on the loan.
151 settlement_term Number of months that the borrower will be on the settlement plan.
You are expected to examine the data and perform feature engineering over the data. For text features,
please pay attention to the possibility of having trailing spaces. For numerical features, please also pay
attention to the existence of outliers and inliers. Relevance of the data to the prediction variable also needs to
be assessed.
Key Considerations
The following questions should be considered during the model building process:
◦ How many samples are there in the dataset?
◦ Which features are available in the dataset?
◦ What is the distribution of numerical feature values across the samples?
◦ What is the total count of missing values per feature?
◦ How many unique values are there for each feature?
◦ What does the number of unique values tell us about a feature?
FIN7790 – Assignment 1 – Predicting Loan Default Page 5 of 6
12/23/2021
◦ Are features assigned the right data types?
◦ Any encoding required for categorical features?
◦ Any feature needs to be scaled and normalized?
◦ Any data cleansing required?
◦ How are the features correlated with one another?
◦ How are the features correlated to the target?
◦ Are all features required to train the model?
◦ Which features are you going to drop and which to keep?
◦ What training dataset to use and what validation dataset to use?
◦ What hyperparameters to use for each model?
◦ How to determine the hyperparameter values to use?
◦ How to evaluate the performance of each model?
◦ Which model turns out to be the best in terms of prediction performance?
How you will be Assessed
In addition to the Python, each group is expected to submit a report with the following structure:
1. Summary of the provided loan dataset (5%)
2. Exploratory data analysis (15%)
3. Data pre-processing (25%)
4. Model training (KNN) (5%)
5. Model training (Naïve Bayes) (5%)
6. Model training (Decision Tree) (5%)
7. Model training (SVM) (5%)
8. Model evaluation (15%)
9. Making prediction (5%)
Other assessment criteria are:
10. Organisation (5%)
11. Grammar (5%)
12. Formatting (5%)
Stick to the Formatting Requirements
The report should stick to the following formatting requirements.
◦ The report should be written in Microsoft PowerPoint format. (No PDF please)
◦ The report is expected to have: header (optional), footer (group number), page numbers, and
description (project/case name) on each slide.
◦ Use an Appendix (at the end of the report) for information about any supporting information and
documents.
Before the due date and time, each group should upload one zip file with the file name Group_Number.zip
(e.g. “Group_1.zip”) to Moodle. The zip file should contain all relevant files including the report (e.g.
FIN7790 – Assignment 1 – Predicting Loan Default Page 6 of 6
12/23/2
“Group_1.ppt”) and other reference articles or materials. All files in the package should follow the same
naming convention above plus a unique 2 digits numeric extension to prevent duplicated file names (e.g.
“Group_1_01_Loan_Default_Assessment.pdf”). Files not named in the specified fashion will cause marks
to be deducted.