ST404-无代写-Assignment 3|学霸联盟

ST404-无代写-Assignment 3

时间：2023-04-16

ST404: Applied Statistical Modelling
Assignment 3: Iranian Churn Data
3.0 ASSIGNMENT WEIGHTING
Assignment 3 counts for 35% of the module mark.
3.1 DEADLINE:
1:00pm Tuesday 2nd May 2023
to be submitted electronically via Moodle in pdf format.
3.2 PROBLEM OUTLINE
The aim of this assignment is to analyse a subset of a dataset available on the UCI website
maintained by (Dua and Graff 2019). The dataset concerns the CHURN of telecom
customers from Iranian companies and was donated by (Jafari-Marandi 2020). An individual
CHURNS if they are a paying customer who fails to renew their contract (typically because
they switch to another provider, in this case another mobile company).
The link to the data on the UCI website seems to no longer exist, but the data are also
available at (Jafari 2020). There are a number of papers that are listed in these sources that
have used these data including (Jafari-Marandi, Denton, et al. 2020), (Keramati, et al. 2014),
and (Keramati and Ardabili 2011). You may wish to scrutinise these papers to give you
further background.
(Jafari-Marandi 2020) state that:
“The dataset is randomly collected from an Iraninan telecom company’s data base of over a
period of 12 months. A total of 3150 rows of data, each representing a customer, bear
information for 13 columns. The attributes that are in this dataset are call failures, frequency
of SMS, number of complaints, number of distinct calls, subscription length, age group, the
charge amount, type of service, seconds of use, status, frequency of use, and Customer
Value.”
Main question: is it possible to predict customers who will
CHURN and explain why they do?
2
3.3 DATA AVAILABILITY
The data are available on Moodle as an R data frame called TeleChurn.Rdata.
Details of the variables and their coding is as follows:
Variable Name Type Detail
CallFailure numeric number of call failures
Complains binary 0 = No complaint, 1 = Complaint
SubscriptionLength numeric total months of subscription
ChargeAmount ordinal attribute 0 lowest amount, 9 highest amount
SecondsOfUse numeric total seconds of calls
FrequencyOfUse numeric total number of calls
FrequencyOfSMS numeric total number of text messages
DistinctCalledNumbers numeric total number of distinct phone calls
AgeGroup* ordinal attribute 1 younger age, 5 older age
TariffPlan binary 1 = Pay as you go, 2 = contractual
Status binary 1 = active, 2= non-active
CustomerValue numeric The calculated value of the customer
Churn binary - Class label 1 = churn, 0 = non-churn
*AgeGroup: the original data set also had age with only five ages and the following
mapping:1=15, 2=25, 3=30, 4=45 and 5= 55 whilst (Keramati and Ardabili 2011) suggest the
age groups are <15, 15-30, 30-45, 45-60, 60-75 respectively.
This variable is to be used to assess the model and should not be used as an explanatory
variable.
Table 1 : Details of the Variables
3.4 ANALYSIS REQUIRED
You are to conduct an analysis of this dataset in R. An outline of the steps you should take
in your analysis is given below.
Given the size of the data it is reasonable to divide your observations into a training and
validation set. The proportion in which you do so should be justified and you should
ensure you do so in a random fashion. It will be useful to set a random number seed
(set.seed(xxx), where xxx is some integer value) so that you always produce the
same sub-samples should you need to re-run any code.
Begin with an exploratory analysis of the data. Using appropriate numerical, tabular or
graphical summaries, describe the distribution of the variables and investigate potential
relationships. You should start with the initial basic plots. However you should then
make use of empirical logistic function and conditional density plots as discussed in
lectures. You are also advised to read (Sheather 2009) which is available from the
library both as a hard copy and electronically. (It is the first book on the Book list for this
module, and the section referred to is only four pages.) Code was provided in lectures
3
that should allow you to complete this quite quickly. Remember you have limited time
and limited room in your report so whilst you should be thorough in your EDA “behind
the scenes”, carefully pick only a few relevant examples to include in your report.
Use logistic regression to investigate the relationship between the dependent variable,
Churn and the explanatory variables. You should attempt this in a number of ways:
a) Find an initial model:
i) Fit a model that contains as a minimum all main effects. You should aim to find a
model that will then be the starting point to reduce the model further in parts 3)b)
and 3)c) below. Below are listed some ideas that you might consider in
developing this model.
(1) You may wish to experiment with transformations of the variables as a result
of your findings in 2) above and to improve this initial model.
(2) You may wish to perform some residual and influential analyses to improve
your model and determine which version of the explanatory variables to use.
You do not need to present details of this model validation, but state if any of
this altered your choice of how each variable should be used in the model.
You will need to present some of these analyses for one of your models in
question 6) below.
(3) You may wish to produce additional variables (e.g. the ratios of one to
another, x and x2, interactions etc.) or manipulate the variables in some other
appropriate way. Any such additional variables should be devised based on
your findings in 2) above and your understanding of the variables. However,
you have limited time, so please only try a few ideas and leave the rest to any
discussion of further work in your conclusion.
(4) Your final model at this stage should contain each of the explanatory
variables in some form or other.
(e.g. for a variable , you might include or () or √ or + ଶ …. etc.).
Again you have limited time, so you should aim to try one or two ideas and
later in your discussion indicate what you would do given more time. As this
is an individual assignment, your aim is to demonstrate you have thought
about the issues of the variables you have, and illustrated how you might
tackle these issues.
ii) Once you have found your initial model, briefly comment on the parameter
estimate table (obtained using summary() in R) for this model, and/or an
ANOVA table as appropriate. Is there any evidence that the model should be
reduced?
iii) Investigate if there any issues of multicollinearity in your model.
b) Use an automated approach to reduce the model (e.g. the step function in R, or a
suitable subset method.) Again present some detail of this appropriate for the
technique you applied.
4
c) Use a suitable shrinkage approach or a Bayesian approach to reduce the model.
Justify your choice of method and present and discuss suitable graphical output.
Clearly explain how you picked your model using the method you selected.
Evaluate and compare your three models found in a), b) and c), and hence derive a final
model. You should do this in a number of ways.
a) You may wish to perform some interim checks of residuals, influence and model
specification to help you pick the most appropriate model. You do not need to
present details of such checks, but state if any of this altered you choice of final
model. You will need to present more detailed checks of the residuals, influence etc.
in your findings in question 6) below.
b) The company does not wish to lose customers. If a customer CHURNs the company
will lose out on the potential profit they may have had. To prevent this, customers
who might be about to CHURN could be contacted and offered deals, upgrades etc.
For this fictitious scenario, suppose an intervention to prevent the customer from
CHURNing would cost the company 70 RI (Iranian rial) on average, and suppose that
all such interventions were successful 60% of the time. For each customer we also
have the customer value; let us treat that as the profit the company will have from
that individual if the customer were to stay with them (i.e. not CHURN). If the
customer were to CHURN we would lose this profit from the customer (so we would
have 0 profit from them).
We can then form a cost matrix or a profit matrix. For example a profit matrix might
be:
Profit Matrix Actual Result
CHURN NOT CHURN
Predicted
From Model
CHURN 0.6*customer value -70 IRR
$customer value -$70
NOT CHURN $0 $customer value
Table 2: Possible Profit matrix
Write a function to evaluate this (or if you prefer a similar function to evaluate
cost/loss). Evaluate the profit (or cost/loss) that each of your candidate models would
suggest.
c) Another way to evaluate a binary classifier is via a ROC chart (Wikipedia 2022). A
number of R packages produce a ROC chart, for example the pROC package
(Robin, et al. 2021) can produce a Bootstrapped estimate of the confidence interval
as well as the main plot.
d) Think about the different models you found in question 3), compare the variables that
remain or do not remain in each of the models from part b) and c). Are there some
variables that are in both models? Are there some that are always excluded? Are
there any that cause there to be some multicollinearity?
5
e) Think about which data set to use in these comparisons. (Training, Validation or
both)
Using these results present one final preferred model. Fully justify why you prefer this
model.
Using the final model selected in question 4):
a) Illustrate how to interpret the parameter estimates:
i) You should interpret fully a parameter for at least one continuous explanatory
variable.
ii) You should interpret fully a parameter associated with one level of a factor
variable.
(If your chosen model only contains continuous variables or factor variables, interpret
a parameter from one of your earlier models.)
b) Illustrate how the model may be used to predict if a customer will CHURN for the
following observation. Please note this should not just be R code, but should be an
illustration using the appropriate formulae.
Table 3: Values of a Customer for which we need a prediction.
Select one of the models from those fitted in Question 3),i.e. 3)b) or 3)c) or your final
model from 4) above. This does not necessarily have to be the final model you created
in Question 4). Illustrate how you may further validate this model by producing two
additional diagnostic plots (of different types). Fully discuss your plots and discuss what
they show. (You might use this to illustrate, for example why you rejected an earlier
model, or you might wish to produce these for your final model to either illustrate that
this fits well, or that, it does not.).
Briefly (no more than one paragraph) discuss the limitations of your model and what
further analysis you could attempt given more time.
Variable Value
CallFailure 10
Complains yes
SubscriptionLength 20
ChargeAmount 2
SecondsOfUse 4000
FrequencyOfUse 80
FrequencyOfSMS 70
DistinctCalledNumbers 10
AgeGroup* 2
TariffPlan pay as you go
Status active
6
3.5 REPORT
3.5.1 Main Report
Write a report on your findings, describing the data, your analyses and your conclusions.
The report, in this fictitious scenario, is aimed at someone in your company who has a
reasonable understanding of the techniques you have used, e.g. other members of your
statistical analysis team. (i.e. they have a similar understanding to that of a fellow student on
this module.)
Note: This is different to assignment 1 and 2 and is deliberately designed to help you think
about how you might approach this for your dissertation.
3.5.2 Abstract
Your report should contain a short abstract on the first page. This should be about 100-200
words.
3.5.3 Guidance: exemplar
A good way to understand how to structure such a report is to look at published journals.
For example (Jenkins and Rios-Avila 2023) is a recent paper from the more applied section
of the Royal Statistical Society’s journals (and is available electronically via our library).
Obviously this reference is for a paper rather than a report or a dissertation. It presents the
work done by two individuals over a longer time-span that your assignment. In looking at it
imagine you had carried out the work. What analyses would have gone on behind the
scenes? What has been presented here? Compare the abstract to the sections of the
paper. Notice how final results and tables are presented and notice the final discussion.
You are unlikely to need a theoretical section, and the results you are presenting need to
match the questions posed above but this none-the-less should provide you some further
guidance.
3.5.4 Avoiding Plagiarism
Keep in mind the advice on academic writing and the rules about referencing, plagiarism and
proof-reading. Make sure that all sources used, whether online or paper-based are
appropriately referenced. The assignment will be submitted to TurnItIn and any cases of
potential plagiarism forwarded to the departmental academic conduct panel.
3.5.5 Page and Word Limit
 The report should be of no more than 8 pages (excluding the front page, abstract,
contents, bibliography and the appendix).
 The maximum word count for the main report is 3000 and does not include the
abstract, Figure labels or headings or the bibliography.
 You should state your word count at the end of the report.
7
 If no word count is provided, you will have to accept the decision of the marker on
whether you have exceeded the word count (which might then include your headings
and figure labels).
 If you exceed the word count by more than 5% but less than or equal to 10% your
mark will be reduced by 5%.
 If you exceed the word count by more than 10% but less than or equal to 15% your
mark will be reduced by 20%,
 If you exceed the word count by more than 15% your mark will be reduced by 50%.
 If you exceed the page limit you will be deducted 5% for every additional page.
3.5.6 Additional Penalties:
 Late submission (-5% per working day)
 Not using appropriate layout (-5%) (see below)
3.5.7 Report Layout
The front cover of your report should give your student ID, but not your name (to allow for
anonymous marking).
The report should be typeset in a professional manner with appropriate margins (at least one
inch), font size 11 or higher and 1.5 spacing. All figures and tables should be numbered and
have captions. These should be referenced in the report. Pages should be numbered. There
should be numbered headings, sub-headings and sub-sub-headings (as in this assignment
brief). You should use appropriate terminology and language (which will partly depend on
which section you are writing.)
3.5.8 Appendix
You should provide an appendix of a suitable selection of your code which should contain
suitable comments. No graphics should be in this appendix. Where code is repeated (e.g.
plotting a histogram of each continuous variable) you need only include one example with a
brief comment to say how this continued for other variables.
3.5.9 Individuality
This is an individual assignment. Collaboration between students is not permitted
(other than questions/answers posted on the discussion forum) and will be treated as
cheating.
8
3.6 ASSESSMENT CRITERIA
3.6.1 Mark Scheme:
This assessment is worth 35% of your final mark on ST404. The assessment will be based
on your understanding of the problem, the competence of your analysis and the presentation
of your report. The report will be marked out of 100 and then weighted with your other
marks.
The marks allocation for this assignment are approximately:
2 marks
15 marks
30 marks
15 marks
8 marks
7 marks
8 marks
Report structure and presentation (including quality of tables and figures,
professionalism, use of numbered headings, page numbers, contents page, figure labels
etc.): 7 marks
Appropriate use of English language (including spelling and grammar, clarity, avoidance
of statistical terms in their colloquial sense (e.g. significant): 8 marks
However exceptional discussion in one section may gain bonus credit to compensate poor
discussion elsewhere.
9
3.7 BIBLIOGRAPHY
Dua, D, and C Graff. 2019. UCI Machine Learning Repositary. Irvine, School of Information
and Computer Sciences University of California. http://archive.ics.uci.edu/ml.
Haddadi, Seyed Jamal, Mohammad Ostad Mohammadi, Mojtaba Bahrami, Elham Khoeini,
Mehdi Beygi, and Mehrdad Haddad Khoshkar. 2022. “Customer Churn Prediction in
the Iranian Banking Sector.” International Conference on Applied Artificial
Intelligence. Halden, Norway: IEEE (Institute of Electrical and Electronics Engineers).
1-6. doi:10.1109/ICAPAI55158.2022.9801574.
Jafari, Roy. 2020. Customer Churn. Kaggle.
https://www.kaggle.com/datasets/royjafari/customer-churn.
Jafari-Marandi, Ruholla. 2020. Iranian Churn Dataset Data Set. UCI Machine Learning
Repository. https://archive.ics.uci.edu/ml/datasets/Iranian+Churn+Dataset.
Jafari-Marandi, Ruholla, Joshua Denton, Adnan Idris, Brian K. Smith, and Abbas Keramati.
2020. “Optimum profit-driven churn decision making: innovative artificial neural
networks in telecom industry.” Neural Computing and Applications (Springer-Verlag
London Ltd.) 32 (18): 14929-14962. doi:https://doi.org/10.1007/s00521-020-04850-6.
Jenkins, Stephen P, and Fernando Rios-Avila. 2023. “Reconciling reports: modelling
employment earnings and measurement errors using linked survey and
administrative data.” Journal of the Royal Statistical Society Series A: Statistics in
Society 186: 110-136. https://doi.org/10.1093/jrsssa/qnac003.
Keramati , Abbas, and Seyed M.S. Ardabili. 2011. “Churn analysis for an Iranian mobile
operator.” Telecommunications Policy 35 (4): 344-356.
doi:https://doi.org/10.1016/j.telpol.2011.02.009.
Keramati, A. , R. Jafari-Marandi, M. Aliannejadi, I. Ahmadian, M. Mozaffari, and U. Abbasi.
2014. “Improved churn prediction in telecommunication industry using data mining
techniques.” Applied Soft Computing (Elsevier) 24: 994-1012.
doi:https://doi.org/10.1016/j.asoc.2014.08.041.
Robin, Xavier [cre, aut], Natacha [aut] Turck, Alexandre [aut] Hainard, Natalia [aut] Tiberti,
Frédérique [aut] Lisacek, Jean-Charles[aut] Sanchez, Markus [aut] Müller, Stefan
[ctb] (Fast DeLong code) Siegert, Matthias [ctb] (Hand & Till Multiclass) Doering, and
Zane [ctb] (DeLong paired test CI) Billings. 2021. Package ‘pROC’. cran.r-project.
https://cran.r-project.org/web/packages/pROC/pROC.pdf.
Sheather, Simon J. 2009. “Transforming Predictors in Logistic Regression for Binary Data.”
Chap. 8.2.3 in A Modern Approach to Regressiuon with R, 282-286. Springer.
https://0-link-springer-com.pugwash.lib.warwick.ac.uk/book/10.1007/978-0-387-
09608-7 (Warwick library logon required).
Wikipedia. 2022. Receiver operating characteristic. 24 10. Accessed 11 20, 2022.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic.