UNIVERSITY
OF
CARDIFF
MAT012
Credit
Risk
Scoring
Assignment
2019/20
This
forms
your
assessment
(100%)
of
this
module.
There
are
two
parts
to
this
assessment.
Part
A
contains
THREE
short
essay-‐based
questions
and
counts
for
50%
of
the
final
mark.
Part
B
contains
FOUR
tasks
to
establish
a
scorecard
using
the
given
dataset
and
counts
for
50%
of
the
final
mark.
You
may
use
Excel,
SAS,
R
or
Python
to
assist
in
the
scorecard
preparation.
You
must
answer
ALL
questions.
Submission
must
be
made
by
3pm
on
Friday
20th
March
via
Learning
Central,
and
instructions
will
follow
shortly
on
how
to
do
this.
You
will
need
to
submit
a
single
file
containing
answers
to
all
questions;
any
spreadsheet
analysis,
workings
or
coding
necessary
can
be
shown
in
an
Appendix
in
that
file.
Only
the
submitted
file
will
be
marked.
PART
A
1. Critically
examine
what
needs
to
be
considered
when
developing
a
credit
risk
scoring
model.
[20
marks]
2. Explain
how,
in
theory,
Cox’s
proportional
hazard
model
for
survival
analysis
can
be
used
for
constructing
a
scorecard.
Comment
on
the
relative
popularity
of
Cox’s
PH
model
versus
logistic
regression
in
scorecard
construction.
[15
marks]
3. Provide
a
brief
literature
review
on
the
use
of
Markov
models
in
credit
risk
modelling,
with
a
particular
focus
on
those
used
in
credit
risk
scoring.
[15
marks]
PART
B
The
dataset
underpinning
the
analysis
here
is
that
used
in
the
lab
sessions
during
lectures.
It
has
been
uploaded
as
a
spreadsheet
named
‘German’
together
with
the
data
dictionary
‘German
data
dictionary’
describing
each
attribute.
You
will
recall
that
the
dataset
consists
of
data
for
1000
applicants
along
with
a
variable
that
says
whether
they
were
subsequently
Good
or
Bad
from
a
credit
perspective.
1. Split
the
dataset
into
two
subsets
as
follows:
Subset
1:
the
applicants
with
Duration
<=
12
months
Subset
2:
the
applicants
where
Duration
>
12
months
Clean
the
subsets
if
necessary.
[5
marks]
2. For
each
subset,
establish
a
training
set
and
validation
set.
Explain:
a. what
principle
you
have
used
to
decide
on
these;
b. why
both
training
and
validation
sets
are
needed;
c. any
issues
encountered
during
the
splitting
exercise.
[5
marks]
3. For
each
training
set
choose
four
variables
which
are
suitable
for
building
a
scorecard.
For
each
training
set
the
variables
must
have
(i)
at
least
one
continuous
variable
before
binning;
(ii)
at
least
one
categorical
variable
with
more
than
two
categories,
so
you
can
see
whether
categories
can
be
combined.
Explain
the
rationale
behind
your
choice
of
variables
(using
supporting
statistics
eg
chi-‐square).
Should
you
be
unable
to
choose
variables
satisfying
the
above
criteria,
explain
the
problem
you
have
encountered
and
the
solution
you
have
chosen
to
compromise
the
variable
selection.
[10
marks]
4. Using
the
binary
variables
obtained
from
the
coarse
classification
in
the
above
exercise
to
build
two
scorecards
for
each
training
set
(so,
two
scorecards
for
those
applicants
with
Duration
<=
12
months;
another
two
for
those
with
Duration
>
12
months),
one
using
linear
regression
and
one
using
logistic
regression.
Note
that
the
file
you
submit
should
include,
in
the
Appendix,
a
table
that
gives
the
binary
variables
you
used,
together
with
the
coefficients
for
those
variables
calculated
in
each
regression.
[15
marks]
5. Derive
ROC
curves
for
all
scorecards
using
the
validation
set
applicable
to
each,
showing
in
detail
how
sensitivity
and
specificity
have
been
calculated.
Estimate
the
Gini
coefficient
and
KS
values
for
each.
Explain
and
comment
on
your
results.
[15
marks]
学霸联盟