xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

R代写-MAS8383

时间：2020-12-14

Project

Submit your project report and the video of your oral presentation via NESS by

11:45pm on Monday 11th January 2021.

Please note that:

• The project comprises two parts: an applied part a theoretical part. The two parts of the

project are disjoint.

• For the applied part, you must write a report which should not exceed seven pages, written

in Word or Latex. Project reports exceeding this limit will be penalised. Note that you

are advised to include an Appendix, which does not count towards the page limit, detailing

enough R code to allow the reader to reproduce your analysis. You may also like to include

supplementary tabular and graphical output.

• There is no page restriction for the theoretical part.

• You should submit the applied and theoretical parts of your project as a single electronic file

in PDF or Word format.

• The video of your oral presentation should not exceed three minutes. The format must be

mp4 and the file must be zipped up as a Zip or Tar Archive file before uploading it to NESS.

You are advised to create and zip up the video using the guidance on Canvas under Modules

−→ Assessment Materials −→ Useful Practical Information.

1 Project brief

In this project, you will analyse the BreastCancer data set which concerns characteristics of breast

tissue samples collected from 699 women in Wisconsin using fine needle aspiration cytology (FNAC).

This is a type of biopsy procedure in which a thin needle is inserted into an area of abnormal-

appearing breast tissue. Nine easily-assessed cytological characteristics, such as uniformity of cell

size and shape, were measured for each tissue sample on a one to ten scale. Smaller numbers

indicate cells that looked healthier in terms of that characteristic. Further histological examination

established whether each of the samples was benign or malignant. The objective of the clinical

experiment was to determine the extent to which a tissue sample could be classified as benign or

malignant using only the nine cytological characteristics.

For the purposes of this project, you may assume that the patients can be regarded as a random

sample from the population of women experiencing symptoms of breast cancer.

The data set is part of the mlbench package. The package can be installed by typing into the console

install.packages("mlbench")

It can then be loaded into R and inspected as follows:

1

MAS8383 Project

## Load mlbench package

library(mlbench)

## Load the data

data(BreastCancer)

## Check size

dim(BreastCancer)

## [1] 699 11

## Print first few rows

head(BreastCancer)

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei

## 1 1000025 5 1 1 1 2 1

## 2 1002945 5 4 4 5 7 10

## 3 1015425 3 1 1 1 2 2

## 4 1016277 6 8 8 1 3 4

## 5 1017023 4 1 1 3 2 1

## 6 1017122 8 10 10 8 7 10

## Bl.cromatin Normal.nucleoli Mitoses Class

## 1 3 1 1 benign

## 2 3 2 1 benign

## 3 3 1 1 benign

## 4 3 7 1 benign

## 5 3 1 1 benign

## 6 9 7 1 malignant

More information on the variables can be found by typing ?BreastCancer in the console.

The project is worth 60% of the overall mark for the module and comprises two parts: an applied

part (worth 80% of the marks) and a theoretical part (worth 15% of the marks). The remaining

5% is reserved for presentation.

The oral presentation is pass / fail. This means it does not carry a mark, but must be passed in

order to pass the module. Its main purpose is to encourage you to focus on explaining statistical

ideas in your own words.

1.1 Applied part

Your goal is to build a classifier for the Class – benign or malignant – of a tissue sample based

on (at least some of) the nine cytological characteristics. It should be stressed that this is a real

data set and there is no “correct” answer. Instead, what is required is evidence of an understanding

of the main statistical ideas, sound interpretation of results, sensible and reasoned comparisons of

classifiers, and demonstration of competence in the use of R as a tool for data analysis.

This part of the project should be written up as a coherent report, giving consideration to the

points detailed in Section 1.1.1 below. You may like to include R code in your report. Alternatively,

you can simply place the code in an Appendix and refer to it as appropriate. You do not need

to comprehensively describe everything you have done to explore and model the data. However,

you should provide a narrative which details and justifies the salient features of your approach, in

addition to reporting and interpreting your results.

2

MAS8383 Project

1.1.1 Points to consider

• You should begin by cleaning the data:

– Technically, the nine cytological characteristics are ordinal variables on a 1 – 10 scale. In

the BreastCancer data, they are encoded as factors. For the purposes of this project,

we will treat them as quantitative variables. You should carefully convert the factors to

quantitative variables and explain why this is a reasonable thing to do.

– This data set contains some missing observations on predictors, encoded as NA. For the

purposes of this project, you should remove all of the rows where there are missing values

before carrying out any further analysis. To do this, you may find the is.na function

helpful. For instance

## Print 24th row of Breast Cancer data and note there is a NA in the

## Bare.nuclei column:

BreastCancer[24,]

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei

## 24 1057013 8 4 5 1 2

## Bl.cromatin Normal.nucleoli Mitoses Class

## 24 7 3 1 malignant

## Test whether each element on the 24th row is a NA:

is.na(BreastCancer[24,])

## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size Bare.nuclei

## 24 FALSE FALSE FALSE FALSE FALSE FALSE TRUE

## Bl.cromatin Normal.nucleoli Mitoses Class

## 24 FALSE FALSE FALSE FALSE

• Consider some exploratory data analysis. For example, how might you summarise the data

graphically and numerically? What does this tell you about the relationships between the

response variable and predictor variables and about the relationships between predictor vari-

ables?

• You should build classifiers using each of the following methods:

– At least one method for subset selection in logistic regression;

– At least one regularized form of logistic regression, i.e. with a ridge or LASSO penalty;

– At least one discriminant analyis method, i.e. the Bayes classifier for linear disciminant

analysis (LDA) or quadratic discriminant analysis (QDA). Write some R code to consider

all possible subsets of predictor variables and choose the subset which minimises the

cross-validation estimate of the test error.

For the variants of logistic regression, you should present the coefficients of the fitted model,

and any other useful graphical or numerical summaries. For LDA and QDA present estimates

of the group means. In each case, discuss what your results show. For example, which variables

drop out of the model when you use subset selection or the LASSO? What do the parameters

tell you about the relationships between the response and predictor variables?

• Compare the performance of your models using cross-validation based on the test error. Think

about how you might do this in a way that makes the comparison fair.

• Select a final “best” classifier, justifying your choice. Does it include all the predictor variables?

Why or why not? What is the nature of the misclassification errors it tends to make?

3

MAS8383 Project

1.2 Theoretical part

We have seen that logistic regression and LDA often perform comparably. For simplicity suppose

that all the predictors are real-valued quantities, i.e. x ∈ Rp. Show that use of logistic regression

with the classification rule

Y =

{

1, if Pr(Y = 1|X = x) > α,

0, otherwise.

is equivalent to a discriminant rule with a linear boundary between the two allocation regions.

1.3 Oral presentation

Present a summary of the main findings from the applied part of your project report. You can make

your slides using whatever presentation software you like, for example Latex Beamer, PowerPoint

or Keynote.

4