UA 202-无代写
时间:2024-05-02
Project Proposal for DS-UA 202 Responsible Data Science
Title: Technical Audit of the Home Credit Default Risk ADS
ADS Selection: Home Credit Default Risk from Kaggle
Model: https://www.kaggle.com/code/burcakaydn/home-credit-default-risk
1. Background
Automated Decision System (ADS) Overview:
The Home Credit Default Risk ADS, hosted on Kaggle, is an advanced analytical tool used
to predict clients' ability to repay loans. Its primary purpose is to assist financial institutions in
making informed lending decisions, thereby reducing the risk of credit defaults. The ADS
analyzes a wide range of financial and personal data to determine the likelihood of a client
successfully repaying a loan.
Goals and Trade-offs:
The main goal of the ADS is to improve loan approval accuracy and minimize financial risk
by predicting default probabilities. However, this objective can lead to potential trade-offs:
• Fairness vs. Accuracy: Increasing accuracy might involve complex models that could
inadvertently discriminate against certain demographic groups if not carefully managed.
• Risk Mitigation vs. Accessibility: While reducing default risk is crucial for financial stability,
overly cautious algorithms could deny loans to potentially reliable clients, particularly those
from underrepresented or financially disadvantaged backgrounds.
2. Input and Output
Data Description:
The Home Credit Default Risk competition on Kaggle utilizes data primarily intended to
predict a client's ability to repay loans. The data was collected from various sources related
to the clients' credit history, including credit bureau data, previous applications, previous
credit amounts, and other socio-demographic information. This comprehensive collection
approach ensures a multifaceted view of the borrower's financial behavior and personal
circumstances, essential for predicting default risks.
Input Features:
• Data Collection and Selection: The data consists of several tables such as application
records, previous applications, credit balances, and more. It represents a mixture of
continuous, categorical, and ordinal data types, providing a holistic view of a customer's
creditworthiness.
• Feature Details:
• Credit History (numerical): Data on previous credits from credit bureaus,
including the amount of the loan, duration, and repayment status.
• Socio-demographic data (categorical): Includes gender, education, family
status, and the number of children.
• Economic features (numerical and categorical): Employment duration, income
type, and ownership of assets like cars or property.
• Missing Values and Distribution: Each table features a mix of complete and incomplete
records. Missing values are prevalent, particularly in socioeconomic indicators and external
database records.
• Profiling and Correlations: Profiling includes examining distributions, identifying outliers,
and understanding feature correlations. Pairwise correlations between numerical features
can reveal dependencies that are crucial for feature engineering and model accuracy.
Output of the System:
• Type of Output: The ADS outputs a probability score that reflects the likelihood of a
client defaulting on a loan.
• Interpretation: This score is typically interpreted as a risk measure where higher values
indicate higher risk. Decisions such as loan approval or denial are made based on this
probability, with thresholds set to balance risk and opportunity.
3. Implementation and Validation Data Cleaning and Pre-processing:
• Handling Missing Values: Given the prevalence of missing data across multiple features in
the Home Credit dataset, discuss methods such as imputation (mean, median, mode, etc.)
or using algorithms that support missing values natively.
• Encoding and Normalization: Detail the encoding of categorical variables using methods
like one-hot encoding or label encoding, and the normalization/standardization of numerical
variables to ensure consistent data scales.
High-level Implementation Overview:
• Model Framework: Highlight the use of ensemble learning techniques, particularly
Gradient Boosting Machines (GBM), which are popular for their effectiveness in similar risk
assessment tasks. Explain the model selection rationale, focusing on their performance in
handling unbalanced datasets like loan defaults.
• Training Process: Describe the training process, emphasizing cross-validation techniques
such as StratifiedKFold, which preserves the percentage of samples for each class, and the
tuning of hyperparameters to optimize model performance.
Validation of the ADS:
• Testing Methodology: Explain the use of out-of-time validation, where the model is
tested on a different time period than it was trained on, to simulate real-world
performance and avoid temporal biases.
• Performance Metrics: Discuss how ROC-AUC, Precision-Recall AUC, and other
relevant metrics are used to evaluate the model's ability to predict defaults accurately,
ensuring it meets its predictive goals.
4. Outcomes Accuracy Analysis:
• Metrics Justification: Justify the use of ROC-AUC due to its ability to handle imbalanced
datasets, along with Precision and Recall to understand both the accuracy and
completeness of the positive class predictions.
• Subpopulation Analysis: Plan an analysis that looks at performance metrics across various
groups (e.g., income level, geographical region) to uncover any discrepancies in model
predictions.
Fairness Analysis:
• Fairness Metrics: Select metrics like Equal Opportunity or Predictive Equality to assess
whether all groups have similar false positive rates and false negative rates, ensuring the
model does not favor one group over another unduly.
• Analysis Plan: Outline how these fairness assessments will be conducted, using
disaggregated performance data to identify and address potential biases.
Additional Performance Analysis:
• Stability and Robustness: Propose methods such as stress testing the model under
various economic scenarios to assess how changes in input data affect predictions.
• Important Examples: Discuss the importance of correctly predicting outcomes for
borderline cases, which are often the most challenging and impactful for financial decisions.
5. Summary
Reflections on Data and Implementation:
• Appropriateness of Data: Reflect on the comprehensiveness and diversity of the dataset,
considering if it accurately reflects the broader applicant pool.
• Robustness, Accuracy, and Fairness: Critique the implementation's robustness and
fairness, discussing the implications of the chosen metrics for different stakeholders.
Deployment Considerations:
• Public Sector vs. Industry: Evaluate the ethical and practical implications of deploying
this ADS in settings like public finance versus commercial banking, considering the
potential impacts on different demographics.
Recommendations for Improvement:
• Data Collection: Suggest more inclusive data collection practices to capture
underrepresented groups more accurately.
• Processing and Analysis: Recommend advanced analytical techniques like machine
learning interpretability tools to enhance transparency and trust in the ADS.
essay、essay代写