3551 Trousdale Rkwy, University Park, Los Angeles, CA
In this coursework we will continue our study of mortgages in the US, but now we will analyze results
at a zipcode level. The question that we want to answer is “can satellite images help our modelling
process?”. For this, you are given two datasets:
A. Aggregated variable information at a zipcode level of the origination of the mortgages. This
variable has a BinaryDefault variable which represent those zipcodes deemed to be of high
risk, and those deemed to be of low risk. The variables that are available are:
a. Fico: Average FICO score of the area.
b. mi_pct cnt_units: Percentage of mortgages with insurance.
c. Cltv, ltv: Average LTV and CLTV of area.
d. cnt_borr: Number of borrowers in area.
e. occpy_sts_S: Percentage of users with occupancy status Single Home (S) in area.
f. channel_C: Percentage of cases with channel C in area.
g. channel_T: Percentage of cases with channel T in area.
h. prop_type_MH: Percentage of cases with property type MH in area.
i. prop_type_PU: Percentage of cases with property type PU in area.
j. loan_purpose_N: Percentage of cases with purpose of the loan type N in area.
k. Area_Number: Zipcode, coded to a meaningless area number. Non-predictive.
l. BinaryDefault: Whether the area is in a high risk area (highest 30% default rate
nationwide) or not. Target variable.
B. Samples of satellite images1 under different conditions for the different zipcodes. They
amount to approximately 2GB of data.
In this coursework, you will develop a multimodal deep learning model for this problem, and
compare it against other alternative models, using what you have learned in the lectures. With this
information, the datasets, and your knowledge from the course, answer the following questions:
1. (10%) Identify the train / test sample at area level (i.e. some areas for test and some for
train that are in the Data folder with images) and create a logistic regression model that,
using only the structured data, can predict whether an area is high risk or low risk. Discuss
the performance of the model, the most important variables of your model, and the
rationale of your decisions and outputs.
2. (30%) Choose from Tensorflow Hub a model that’s adequate for your problem2. Explain the
model in detail by researching the literature. What layers does it use? Why those layers?
Discuss your choice. Finetune a deep learning model able predict high and low risk zones.
Explain what parameters you used to train it (optimizer, trainable layers, learning rate, etc),
and your choice of architecture for the dense and output layers. Is it able to find meaningful
patterns using only the images? Why do you think this is?
1 The data is available for direct download from Google Colab (using gdown). The link is
https://drive.google.com/uc?id=1k7QmTjzk4hFrAnO_x8YvndoWyZ2H_LZw I advise you to download it to
your Google Drive folder and mount it from there to not download it every time you need to work on it.
2 Any StateModel model that give you feature vectors or feature classification without the classifier head
FM 9528 - Banking Analytics Coursework 3
3. (30%) Combine the structured input of and the images, plus the pretrained model, to create
a multimodal deep learning model that takes into account both inputs into a single neural
network. Use the Keras Model API to create this model. Discuss the reasons for your choice
of architecture and parameter decisions, report the AUC scores of all models and compare
the performance. Do the satellite images help? How does the performance of the model
compare against the models you previously trained? Why do you think this happens?
4. (20%) Discuss the ethical, legal, and other challenges of using satellite images in the context
of credit risk. Discuss with sources the following questions: What are the potential sources
of bias in satellite images? Is this reflected in your model? What are the ethical implications
of your findings? What potential legal ramifications can exist? Finalize by giving your opinion
on the use of satellite images for the purposes of credit risk analytics.
The remaining 10% corresponds to formatting and presentation according to the rubric.
Conditions of the coursework
Software: You must use Python to run the numerical calculations over your portfolio. A copy of your
jupyter notebook must be attached to the coursework as an appendix in readable format, and a link
to the notebook (either colab or direct download from a cloud location) must also be included.
Instructions how to export to PDF can be found here:
The notebook text MUST be machine readable (so no screenshots of the notebook please)
otherwise a 25% discount will apply.
Word Limit: 2000 words +/-10% either side of the word count is deemed to be acceptable. Any text
that exceeds an additional 10% will not attract any marks. The relevant word count includes items
such as cover page, executive summary, title page, table of contents, tables, figures, in-text citations
and section headings, if used. The relevant word count excludes your list of references and any
appendices at the end of your coursework submission (including the code).
You should always include the word count (from Microsoft Word, not Turnitin), at the end of your
coursework submission, before your list of references.
Title/Cover Page: You must include a title/ cover page that includes: your Student ID, Course Code,
Assignment Title, Word Count. This assignment will be marked anonymously, please ensure that
your name does not appear on any part of your assignment otherwise a discount will be applied.
Submission Deadline: December 18th, 23:59. This deadline is final and cannot be modified!
Turnitin Submission: The assignment MUST be submitted electronically via OWL. All required
papers may be subject to submission for textual similarity review to the commercial plagiarism
detection software under license to the University for the detection of plagiarism. All papers
submitted for such checking will be included as source documents in the reference database for the
purpose of detecting plagiarism of papers subsequently submitted to the system. Use of the service
is subject to the licensing agreement, currently between The University of Western Ontario and
FM 9528 - Banking Analytics Coursework 3
Late Submission: Late submissions are possible up to two days after the deadline. There is a linear
10% penalty per day of late submission (Final mark = Original mark – 10% * day) subtracted directly
from the final mark. Submissions after the two days are not accepted and will be considered a non-