xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

扫码添加客服微信

扫描添加客服微信

R代写|Assignment代写 - FIT2086 Assignment 3

时间：2020-10-26

Introduction
There are total of three questions worth 10 + 18 + 14 = 42 marks in this assignment.
This assignment is worth a total of 20% of your final mark, subject to hurdles and any other matters
(e.g., late penalties, special consideration, etc.) as specified in the FIT2086 Unit Guide or elsewhere
in the FIT2086 Moodle site (including Faculty of I.T. and Monash University policies).
Students are reminded of the Academic Integrity Awareness Training Tutorial Activity and, in particular, of Monash University’s policies on academic integrity. In submitting this assignment, you
acknowledge your awareness of Monash University’s policies on academic integrity and that work is
done and submitted in accordance with these policies.
Submission: No files are to be submitted via e-mail. Correct files are to be submitted to Moodle, as
given above. You must submit the following three files:
1. One PDF file containing non-code answers to all the questions that require written answers. This
file should also include all your plots.
2. The two required R script files containing R code answers as discussed in Question 2 and 3.
Please read these submission instructions carefully and take care to submit the correct files in the
correct places.
1
Question 1 (10 marks)
This question will require you to analyse a regression dataset. The file housing.ass3.2020.csv
contains the data that we will use for this question. This dataset is a modified version of the Boston
housing data which was collected to study house prices in the metropolitan region of Boston. In this
data set, each observation represents a particular suburb from the Boston region. The outcome, medv,
is the median value of owner-occupied homes in 1, 000 in the suburb. The variables are summarised
in Table 1. The data consists of p = 12 variables measured on n = 250 suburbs. We are interested
in discovering which predictors are good determinants of housing price, and how these variables effect
the median house price.
1. Fit a multiple linear model to the housing data using R. Using the results of fitting the linear
model, which predictors do you think are possibly associated with median house value, and why?
Which three variables appear to be the strongest predictors of housing price, and why? [3
marks]
2. How would your assessment of which predictors are associated change if you used the Bonferroni
procedure with α = 0.05? [1 mark]
3. Describe what effect the per-capita crime rate (crim) appears to have on the median house price.
Describe what effect a suburb having frontage on the Charles River has on the median house
price for that suburb. [2 marks]
4. Use the stepwise selection procedure, with the BIC criterion, to prune out potentially unimportant variables. Write down the final regression equation obtained after pruning. [1 mark]
5. If a council wanted to try and improve the median house value in their suburb, what does the
model that we found in Question 1.4 suggest they could try and do? [2 marks]
6. Table 2 gives the values of predictors for a new suburb. Use the model found in Question 1.4
to predict the median house price for this suburb. Provide a 95% confidence interval for this
prediction. [1 mark]
2
Variable name Description Values
crim Per-capita crime rate > 0
zn Proportion of residential land zoned for lots over 25,000 sq. ft. 0 0 100
indus Proportion of non-retail business acres per town 0 0 100
chas Does the suburb front the Charles River? 0 = No, 1 = Yes
nox Nitric oxides concentration (parts per 10 million) > 0
rm Average number of rooms per dwelling ≥ 1
age Proportion of owner-occupied units built prior to 1940 0 0 100
dis Weighted distances to five Boston employment centres > 0
rad Index of accessibility to radial highways > 0
tax Full-value property-tax rate per $10,000 187 7 711
ptratio Pupil-teacher ratio > 0
lstat Percentage of “lower status” of the population 0 0 100
medv Median value of owner-occupied homes in $1, 000s > 0
Table 1: Boston Housing Data Dictionary.
Variable crim zn indus chas nox rm age dis rad tax ptratio lstat
Value 0.04741 0 11.93 0 0.573 6.03 80.8 2.505 1 273 21 7.88
Table 2: Boston Housing Data Dictionary.
some text
3
Question 2 (18 marks)
In this question we will analyse the data in heart.train.ass3.2020.csv. In this dataset, each
observation represents a patient at a hospital that reported showing signs of possible heart disease.
The outcome is presence of heart disease (HD), or not, so this is a classification problem. The predictors
are summarised in Table 3. We are interested in learning a model that can predict heart disease from
these measurements. To answer this question you must:
• Provide an R script containing all the code you used to answer the questions. Please use comments to ensure that the code used to identify each question is clearly identifiable. Call this
fn.sn.Q2.R, where “fn.sn” is your first name followed by your family name.
• Provide appropriate written answers to the questions, along with any graphs, in a report document.
When answering this question, you must use the rpart package that we used in Studio 9. The
wrapper function for learning a tree using cross-validation that we used in Studio 9 is contained in the
file wrappers.R. Don’t forget to source this file to get access to the function.
1. Using the techniques you learned in Studio 9, fit a decision tree to the data using the tree
package. Use cross-validation with 10 folds and 5, 000 repetitions to select an appropriate size
tree. What variables have been used in the best tree? How many leaves (terminal nodes) does
the best tree have? [2 marks]
2. Plot the tree found by CV, and discuss clearly and thoroughly in plain English what it tells
you about the relationship between the predictors and heart disease. (hint: you can use the
text(cv$best.tree,pretty=12) function to add appropriate labels to the tree). [3 marks]
3. For classification problems, the rpart package only labels the leaves with the most likely class.
However, if you examine the tree structure in its textural representation on the console, you can
determine the probabilities of having heart disease (see Question 2.3 from Studio 9 as a guide)
in each leaf (terminal node). Take a screen-capture of the plot of the tree (don’t forget to use
the “zoom” button to get a larger image) or save it as an image using the “Export” button in R
Studio.
Then, use the information from the textual representation of the tree available at the console
and annotate the tree in your favourite image editing software; next to all the leaves in the tree,
add text giving the probability of contracting heart disease. Include this annotated image in
your report file. [2 marks]
4. According to your tree, which predictor combination results in the highest probability of having
heart-disease? [1 mark]
5. We will also fit a logistic regression model to the data. Use the glm() function to fit a logistic
regression model to the heart data, and use stepwise selection with the BIC score to prune
the model. What variables does the final model include, and how do they compare with the
variables used by the tree estimated by CV? Which predictor is the most important in the
logistic regression? [3 marks]
6. Write down the regression equation for the logistic regression model you found using step-wise
selection. [1 mark]
4
7. The file heart.test.ass3.2020.csv contains the data on a further n0 = 200 individuals. Using
the my.pred.stats() function contained in the file my.prediction.stats.R, compute the prediction statistics for both the tree and the step-wise logistic regression model on this test data.
Contrast and compare the two models in terms of the various prediction statistics? Would one
potentially be preferable to the other as a diagnostic test? Justify your answer. [2 marks]
8. Calculate the odds of having heart disease for the patient in the 69th row of the test dataset.
The odds should be calculated for both:
(a) the tree model found using cross-validation; and
(b) the step-wise logistic regression model.
How do the predicted odds for the two models compare? [2 marks]
9. For the logistic regression model using the predictors selected by BIC in Question 2.6, use the
bootstrap procedure (use at least 5, 000 bootstrap replications) to find a confidence interval for
the probability of having heart disease for patient in the 69th row in the test data. Use the bca
option when computing this confidence interval. Discuss this confidence interval in comparison
to the predicted probabilities of having heart disease for both the logistic regression model and
the tree model. [2 marks]
5
Variable name Description Values
AGE Age of patient in years 29 9 77
SEX Sex of patient M = Male
F = Female
CP Chest pain type Typical = Typical angina
Atypical = Atypical angina
NonAnginal = Non anginal pain
Asymptomatic = Asymptomatic pain
TRESTBPS Resting blood pressure (in mmHg) 94 4 200
CHOL Serum cholesterol in mg/dl 126 6 564
FBS Fasting blood sugar > 120mg/dl ? <120 = No
>120 = Yes
RESTECG Resting electrocardiographic results Normal = Normal
ST.T.Wave = ST wave abnormality
Hypertrophy = showing probable hypertrophy
THALACH Maximum heart rate achieved 71 1 202
EXANG Exercise induced angina? N = No
Y = Yes
OLDPEAK Exercise induced ST depression relative to rest 0 0 6.2
SLOPE Slope of the peak exercise ST segment Up = Up-sloping
Flat = Flat
Down = Down-sloping
CA Number of major vessels colored by flourosopy 0 0 3
THAL Thallium scanning results Normal = Normal
Fixed.Defect = Fixed fluid transfer defect
Reversible.Defect = Reversible fluid transfer defect
HD Presence of heart disease N = No
Y = Yes
Table 3: Heart Disease Data Dictionary. ST depression refers to a particular type of feature in an
electrocardiograph (ECG) signal during periods of exercise. Thallium scanning refers to the use of
radioactive Thallium to check the fluid transfer capability of the heart.
6
7500 8000 8500 9000
Mass/Charge (MZ)
-20
0
20
40
60
80
100
Relative Intensity
Measurements
Spectrum
Figure 1: Noisy measurements from a (simulated) mass spectrometry reading. The “true” (unknown)
measurements are shown in orange, and the noisy measurements are shown in blue.
Question 3 (14 marks)
Data Smoothing
Data “smoothing” is a very common problem in data science and statistics. We are often interested
in examining the unknown relationship between a dependent variable (y) and an independent variable
(x), under the assumption that the dependent variable has been imperfectly measured and has been
contaminated by measurement noise. The model of reality that we use is
y = f(x) + ε
where f(x) is some unknown, “true”, potentially non-linear function of x, and ε ∼ N(0, σ2) is a random
disturbance or error. This is called the problem of function estimation, and the process of estimating
f(x) from the noisy measurements y is sometimes called “smoothing the data” (even if the resulting
curve is not “smooth” in a traditional sense, it is less rough than the original data).
In this question you will use the k-nearest neighbours machine learning technique to smooth data.
This technique is used frequently in practice (think for example the 14-day rolling averages used to
estimate coronavirus infection numbers). This question will explore its effectiveness as a smoothing
tool.
7
Mass Spectrometry Data Smoothing
The file ms.train.2020.csv contains n = 443 measurements from a mass spectrometer. Mass spectrometry is a chemical analysis tool that provides a measure of the physical composition of a material.
The outputs of a mass spectrometry reading are the intensities of various ions, indexed by their massto-charge ratio. The resulting spectrum usually consists of a number of relatively sharp peaks that
indicate a concentration of particular ions, along with an overall background level. A standard problem is that the measurement process is generally affected by noise – that is, the sensor readings are
imprecise and corrupted by measurement noise. Therefore, smoothing, or removing the noise is crucial
as it allows us to get a more accurate idea of the true spectrum, as well as determine the relative
quantity of the ions more accurately. However, we would ideally like for our smoothing procedure to
not damage the important information contained in the spectrum (i.e., the heights of the peaks).
The file ms.train.csv contains measurements of our mass spectrometry reading; ms.train$MZ are
the mass-to-charge ratios of various ions, and ms.train$intensity are the measured (noisy) intensities
of these ions in our material. The file ms.test.2020.csv contains n = 886 different values of MZ along
with the “true” intensity values, stored in ms.test.2020$intensity. These true values have been
found by using several advanced statistical techniques to smooth the data, and are being used here to
see how close your estimated spectrum is to the truth. For reference, the samples ms.train$intensity
and the value of the true spectrum ms.test$intensity are plotted in Figure 1 against their respective
MZ values. To answer this question you must:
• Provide an R script containing all the code you used to answer the questions. Please use comments to ensure that the code used to identify each question is clearly identifiable. Call this
file fn.sn.Q3.R, where “fn.sn” is your first name followed by your family name.
• Provide appropriate written answers to the questions, along with any graphs, in a non-handwritten
report document.
To answer this question, you must use the kknn and boot packages that we used in Studios 9 and 10.
Questions
1. Use the k-nearest neighbours method (k-NN) to estimate the underlying spectrum from the
training data. Use the kknn package we examined in Studio 9 to provide predictions for the
MZ values in ms.test, using ms.train as the training data. You should use the kernel =
"optimal" option when calling the kknn() function. This means that the predictions are formed
by a weighted average of the k points nearest to the point we are trying to predict, the weights
being determined by how far away the neighbours are from the point we are trying to predict.
(a) For each value of k = 1, . . . , 25, use k-NN to estimate the values of the spectrum for the
MZ values in ms.test$MZ. Then, compute the mean-squared error between your estimates
of the spectrum, and the true values in ms.test$intensity. Produce a plot of these errors
against the various values of k. [1 mark]
(b) Produce four graphs, each one showing: (i) the training data points (ms.train$intensity),
(ii) the true spectrum (ms.test$intensity) and (iii) the estimated spectrum (predicted
intensity values for the MZ values in ms.test.csv) produced by the k-NN method for four
different values of k; do this for k = 2, k = 5, k = 10 and k = 25. Make sure the graphs have
clearly labelled axis’ and a clear legend. Use a different colour for your estimated curve.
[3 marks]
(c) Discuss, qualitatively, and quantitatively (in terms of mean-squared error on the true spectrum) the four different estimates of the spectrum. [2 marks]
8
2. Use the cross-validation functionality in the kknn package to select an estimate of the best value
of k (make sure you still use the optimal kernel). What value of k does the method select?
How does it compare to the (in practice, unknown) value of k that would minimise the actual
mean-squared error (as computed in Question 3.1a)? [1 mark]
3. Using the estimates of the curve produced in the previous question, see if you can provide
an estimate of the variance of the sensor/measurement noise that has corrupted our intensity
measurements. [1 mark]
4. Do any of the estimated spectra achieve our aim of providing a smooth, low-noise estimate of
background level as well as accurate estimation of the peaks? Explain why you think the k-NN
method is able to achieve, or not achieve, this aim. [2 marks] .
5. An important task when processing mass spectrometry signals is to locate the peaks, as this
gives information on which elements are present. From the smoothed signal produced using the
value of k found in Question 3.2, which value of MZ corresponds to the maximum estimated
abundance? [1 mark]
6. Using the bootstrap procedure (use at least 5, 000 bootstrap replications), write code to find a
confidence interval for the k-nearest neighbours estimate of relative abundance at a specific MZ
value. Use this code to obtain a 95% confidence interval for the estimate of relative abundance
at the MZ value you determined previously in Question 3.5 (i.e., the value corresponding to the
highest relative intensity). Compute confidence intervals using the k determined in Question 3.2,
as well as k = 3 neighbour and k = 20 neighbours. Report these confidence intervals. Explain
why you think these confidence intervals vary in size for different values of k. [3 marks]

- 留学生代写
- Python代写
- Java代写
- c/c++代写
- 数据库代写
- 算法代写
- 机器学习代写
- 数据挖掘代写
- 数据分析代写
- Android代写
- html代写
- 计算机网络代写
- 操作系统代写
- 计算机体系结构代写
- R代写
- 数学代写
- 金融作业代写
- 微观经济学代写
- 会计代写
- 统计代写
- 生物代写
- 物理代写
- 机械代写
- Assignment代写
- sql数据库代写
- analysis代写
- Haskell代写
- Linux代写
- Shell代写
- Diode Ideality Factor代写
- 宏观经济学代写
- 经济代写
- 计量经济代写
- math代写
- 金融统计代写
- 经济统计代写
- 概率论代写
- 代数代写
- 工程作业代写
- Databases代写
- 逻辑代写
- JavaScript代写
- Matlab代写
- Unity代写
- BigDate大数据代写
- 汇编代写
- stat代写
- scala代写
- OpenGL代写
- CS代写
- 程序代写
- 简答代写
- Excel代写
- Logisim代写
- 代码代写
- 手写题代写
- 电子工程代写
- 判断代写
- 论文代写
- stata代写
- witness代写
- statscloud代写
- 证明代写
- 非欧几何代写
- 理论代写
- http代写
- MySQL代写
- PHP代写
- 计算代写
- 考试代写
- 博弈论代写
- 英语代写
- essay代写
- 不限代写
- lingo代写
- 线性代数代写
- 文本处理代写
- 商科代写
- visual studio代写
- 光谱分析代写
- report代写
- GCP代写
- 无代写
- 电力系统代写
- refinitiv eikon代写
- 运筹学代写
- simulink代写
- 单片机代写
- GAMS代写
- 人力资源代写
- 报告代写
- SQLAlchemy代写
- Stufio代写
- sklearn代写
- 计算机架构代写
- 贝叶斯代写
- 以太坊代写
- 计算证明代写
- prolog代写
- 交互设计代写
- mips代写
- css代写
- 云计算代写
- dafny代写
- quiz考试代写
- js代写
- 密码学代写
- ml代写
- 水利工程基础代写
- 经济管理代写
- Rmarkdown代写
- 电路代写
- 质量管理画图代写
- sas代写
- 金融数学代写
- processing代写
- 预测分析代写
- 机械力学代写
- vhdl代写
- solidworks代写
- 不涉及代写
- 计算分析代写
- Netlogo代写
- openbugs代写
- 土木代写
- 国际金融专题代写
- 离散数学代写
- openssl代写
- 化学材料代写
- eview代写
- nlp代写
- Assembly language代写
- gproms代写
- studio代写
- robot analyse代写
- pytorch代写
- 证明题代写
- latex代写
- coq代写
- 市场营销论文代写
- 人力资论文代写
- weka代写
- 英文代写
- Minitab代写
- 航空代写
- webots代写
- Advanced Management Accounting代写
- Lunix代写
- 云基础代写
- 有限状态过程代写
- aws代写
- AI代写
- 图灵机代写
- Sociology代写
- 分析代写
- 经济开发代写
- Data代写
- jupyter代写
- 通信考试代写
- 网络安全代写
- 固体力学代写
- spss代写
- 无编程代写
- react代写
- Ocaml代写
- 期货期权代写
- Scheme代写
- 数学统计代写
- 信息安全代写
- Bloomberg代写
- 残疾与创新设计代写
- 历史代写
- 理论题代写
- cpu代写
- 计量代写
- Xpress-IVE代写
- 微积分代写
- 材料学代写
- 代写
- 会计信息系统代写
- 凸优化代写
- 投资代写
- F#代写
- C#代写
- arm代写
- 伪代码代写
- 白话代写
- IC集成电路代写
- reasoning代写
- agents代写
- 精算代写
- opencl代写
- Perl代写
- 图像处理代写
- 工程电磁场代写
- 时间序列代写
- 数据结构算法代写
- 网络基础代写
- 画图代写
- Marie代写
- ASP代写
- EViews代写
- Interval Temporal Logic代写
- ccgarch代写
- rmgarch代写
- jmp代写
- 选择填空代写
- mathematics代写
- winbugs代写
- maya代写
- Directx代写
- PPT代写
- 可视化代写
- 工程材料代写
- 环境代写
- abaqus代写
- 投资组合代写
- 选择题代写
- openmp.c代写
- cuda.cu代写
- 传感器基础代写
- 区块链比特币代写
- 土壤固结代写
- 电气代写
- 电子设计代写
- 主观题代写
- 金融微积代写
- ajax代写
- Risk theory代写
- tcp代写
- tableau代写
- mylab代写
- research paper代写
- 手写代写
- 管理代写
- paper代写
- 毕设代写
- 衍生品代写
- 学术论文代写
- 计算画图代写
- SPIM汇编代写
- 演讲稿代写
- 金融实证代写
- 环境化学代写
- 通信代写
- 股权市场代写
- 计算机逻辑代写
- Microsoft Visio代写
- 业务流程管理代写
- Spark代写
- USYD代写
- 数值分析代写
- 有限元代写
- 抽代代写
- 不限定代写
- IOS代写
- scikit-learn代写
- ts angular代写
- sml代写
- 管理决策分析代写
- vba代写
- 墨大代写
- erlang代写
- Azure代写
- 粒子物理代写
- 编译器代写
- socket代写
- 商业分析代写
- 财务报表分析代写
- Machine Learning代写
- 国际贸易代写
- code代写
- 流体力学代写
- 辅导代写
- 设计代写
- marketing代写
- web代写
- 计算机代写
- verilog代写
- 心理学代写
- 线性回归代写
- 高级数据分析代写
- clingo代写
- Mplab代写
- coventorware代写
- creo代写
- nosql代写
- 供应链代写
- uml代写
- 数字业务技术代写
- 数字业务管理代写
- 结构分析代写
- tf-idf代写
- 地理代写
- financial modeling代写
- quantlib代写
- 电力电子元件代写
- atenda 2D代写
- 宏观代写
- 媒体代写
- 政治代写
- 化学代写
- 随机过程代写
- self attension算法代写
- arm assembly代写
- wireshark代写
- openCV代写
- Uncertainty Quantificatio代写
- prolong代写
- IPYthon代写
- Digital system design 代写
- julia代写
- Advanced Geotechnical Engineering代写
- 回答问题代写
- junit代写
- solidty代写
- maple代写
- 光电技术代写
- 网页代写
- 网络分析代写
- ENVI代写
- gimp代写
- sfml代写
- 社会学代写
- simulationX solidwork代写
- unity 3D代写
- ansys代写
- react native代写
- Alloy代写
- Applied Matrix代写
- JMP PRO代写
- 微观代写
- 人类健康代写
- 市场代写
- proposal代写
- 软件代写
- 信息检索代写
- 商法代写
- 信号代写
- pycharm代写
- 金融风险管理代写
- 数据可视化代写
- fashion代写
- 加拿大代写
- 经济学代写
- Behavioural Finance代写
- cytoscape代写
- 推荐代写
- 金融经济代写
- optimization代写
- alteryxy代写
- tabluea代写
- sas viya代写
- ads代写
- 实时系统代写
- 药剂学代写
- os代写
- Mathematica代写
- Xcode代写
- Swift代写
- rattle代写
- 人工智能代写
- 流体代写
- 结构力学代写
- Communications代写
- 动物学代写
- 问答代写
- MiKTEX代写
- 图论代写
- 数据科学代写
- 计算机安全代写
- 日本历史代写
- gis代写
- rs代写
- 语言代写
- 电学代写
- flutter代写
- drat代写
- 澳洲代写
- 医药代写
- ox代写
- 营销代写
- pddl代写
- 工程项目代写
- archi代写
- Propositional Logic代写
- 国际财务管理代写
- 高宏代写