DATA3888 -R Studio代写-Assignment 1
时间:2024-03-13
DATA3888 (2024): Assignment 1
Instructions
1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown
or Quarto. Name your file as SIDXXX_Assignment.html” where XXX is your Student ID.
2. Under author, put your Student ID at the top of the Rmd file (NOT your name).
3. For your assignment, please use set.seed(3888) at the start of each chunk (where required).
4. Do not upload the code file (i.e. the Rmd or qmd file).
5. You must use code folding so that the marker can inspect your code where required.
6. Your assignment should make sense and provide all the relevant information in the text when the code
is hidden. Don’t rely on the marker to understand your code.
7. Any output that you include needs to be explained in the text of the document. If your code chunk
generates unnecessary output, please suppress it by specifying chunk options like message = FALSE.
8. Start each of the 3 questions in a separate section. The parts of each question should be in the same
section.
9. You may be penalised for excessive or poorly formatted output.
Question 1: Reef
Between 2014-2017, marine scientists recorded an unprecedented global coral bleaching event. Your
friend Farhan is a marine science expert who wants to study the environmental variables that
may have triggered this event. To do this, we will use a public dataset, curated by Sally and
colleagues. This dataset records coral bleaching events at 3351 locations in 81 countries from
1998 to 2017 with a suite of environmental and temperature metrics. The data is in the file
Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv and the full descrip-
tion of the variables can be found in the supplementary table of the study.
Part (a)
Farhan has noticed on average the North of Australia experienced higher levels of coral bleaching compared
to the South, during the global bleaching event from 2014-2017. In the paper, the authors find that the
following variables are associated with the probability of coral bleaching.
• TSA_Frequency_Standard_Deviation
• Temperature_Mean
• TSA_Frequency
• Temperature_Kelvin_Standard_Deviation
• TSA_DHW_Standard_Deviation
• SSTA_Frequency_Standard_Deviation
Create one informative graphic to visualise how these six variables are different between the North
and South of Australia during the 2014-2017 global coral bleaching event. Explain any data filtering or
transformation that you perform. Comment on the visualisation and suggest at least one variable that
appears to be different between the North and the South and thus may be associated with the higher levels
of bleaching observed in the North.
Note: the midpoint of Australia is located at -23 degrees Latitude. Observations higher than -23 degrees
latitude is considered North Australia. Your graphic can have multiple panels.
1
Part (b)
Farhan is interested in exploring which reefs were the most affected by the 2014-2017 global bleaching
event, across the globe. Create an interactive map visualisation to show the average proportion of coral
bleaching between 2014-2017, that allows a marine scientist to identify the names of the most affected coral
reefs, the region (recorded as State.Province.Island) and the values of the measurements of the associated
environmental variables identified in part (a). Justify your choice of visualisation, and comment on the result.
List 4 regions that were severely bleached in this time period.
Part (c)
Farhan wants to explore the impact of environmental variables on coral bleaching in the most affected regions.
For the regions identified in part (b), create one informative visualisation to show how the average
bleaching has changed over time (not restricted to 2014-2017), and its relationship with one of the associated
environmental variables identified in part (a). Comment on the visualisation.
Note: your graphic can have multiple panels.
2
Question 2: Kidney
Your friend Harry is a nephrologist (kidney specialist) who is interested in building an accurate classifier to
detect graft rejection in his kidney transplant patients. He is also interested in knowing which genes may
be affecting graft rejection. In this problem, we will build a classification model using the public data set
GSE138043. We will perform feature selection and build a classifier, estimating its accuracy on unseen data.
Part (a)
Harry wants to know the most differentially expressed genes between patients that experience graft rejection
and stable patients. Use the topTable function in the limma package to output the gene symbols of the 10
most differentially expressed genes.
Hint: in the GSE138043 dataset, the outcome is found in the characteristics_ch1 column of the phenoData
and the gene symbols are found the in gene_assignment column of the featureData, between the first and
second // symbols.
Part (b)
Harry wants to build a random forest classifier to predict whether a patient is stable or experiencing graft
rejection and estimate its accuracy on unseen data. To do this, Harry tries to perform repeated cross-validation
on the entire data set, but it takes too long to run. To speed up the model training, Harry knows he can
implement feature selection in one of 3 parts of the framework on the next page (OPTION A, OPTION B, or
OPTION C), however he is not sure which one.
Explain the difference (if any) between the 3 options and which option(s) would be the most appropriate for
Harry’s task.
Part (c)
Harry wants to implement feature selection in the most appropriate option of Part (b), but he’s not sure how
many features he should select. Use the framework from part (b) to evaluate the performance of a random
forest classifier on unseen data with feature selection taking the top 10, 50 or 100 genes. Visualise your results
and comment on them. How many features would you recommend Harry to use?
Hint: if implemented correctly, this code should take no more than a few minutes to run.
Part (d)
Using the optimal number of features found in part (c), build a random forest classifier on the entire training
data set, that Harry could implement on future data. Harry wants to know which genes are the most
important in making the classification. Output the gene symbols of the top 10 genes in terms of importance in
the random forest classifier. Comment on the overlap between the top 10 important genes in the classifier and
the top 10 differentially expressed genes (if your final model only uses 10 genes, comment on the concordance
in ranking of the 10 genes).
Hint: in a random forest model fit, the feature importance can be obtained by fit$importance, where a
higher value indicates higher importance in the classifier.
3
Question 2 Part (b) appendix
set.seed(3888)
X = t(exprs(gse))
y = ifelse(grepl("non-AR", pData(gse)$characteristics_ch1), "Stable", "Rejection")
cvK = 5
n_sim = 50
cv_accuracy_gse1b = numeric(n_sim)
### OPTION A ###
for (i in 1:n_sim) {
cvSets = cvFolds(nrow(X), cvK)
cv_accuracy_folds = numeric(cvK)
### OPTION B ###
for (j in 1:cvK) {
test_id = cvSets$subsets[cvSets$which == j]
X_train = X[-test_id,]
X_test = X[test_id,]
y_train = y[-test_id]
y_test = y[test_id]
### OPTION C ###
rf_fit = randomForest(x = X_train, y = as.factor(y_train))
predictions = predict(rf_fit, X_test)
cv_accuracy_folds[j] = mean(y_test == predictions)
}
cv_accuracy_gse1b[i] = mean(cv_accuracy_folds)
}
4
Question 3: Brain
Your friend Shila is a physicist who needs your help in building a classifier to detect left and right eye
movements from brain EEG signals in real time. She has a data set stored under zoe_spiker.zip that
contains brain signal series (each series is a file) which corresponds to sequences of eye movements of varying
lengths.
The file name corresponds to the true eye movement. For example the file LRL_z.wav corresponds to
left-right-left eye movements; the file LLRLRLRL_z.wav corresponds to left-left-right-left-right-left-right-left
eye movements. There are a total of 31 files.
The folder also contains two RDS files which may be used to train an event detection classifier
(training_data.rds, training_labels.rds)
Part (a)
The first stage of our classifier is to identify events (eye movement). Shila has provided some training
data (training_data.rds) which corresponds to waves, and labels (training_labels.rds) where TRUE
represents the presence of an event and FALSE represents no event. Use the tsfeatures package to calculate
some autocorrelation features and build a random forest classifier to detect events.
Report and comment on the accuracy of this model.
Hint: use tsfeatures(training_data, c("acf_features")) to compute the autocorrelation features from
training_data. In a random forest model fit, the confusion matrix of out-of-bag predictions can be obtained
by fit$confusion. In a random forest classifier, the out-of-bag predictions can be treated as the predictions
on a independent data set.
Part (b)
Build a classification rule for detecting {L,R} under a streaming condition, using the trained Random
Forest model from part (a) in a window to identify events, and using the min-max rule to classify events into
“Left” or “Right” (Lab 3 Exercise 2.3). Demonstrate your classifier on a length 3, 8 and long wave file (note
that the result should be reasonable, but doesn’t have to be good). You may use the code template on the
following page.
Part (c)
Shila thinks multiple window sizes must be evaluated to find the best Random Forest streaming classifier.
Compare the performance of the Random Forest streaming classifier for detecting {L,R} under a streaming
condition, using multiple window sizes. Use the short wave files to evaluate performance. Which window
size gives the best performance? Justify your answer with appropriate visualisations.
Hint: you may use the Levenshtein similarity metric to evaluate the accuracy of your predictions. This can be
computed via stringdist::stringsim, with method set to "lv".
The increment of your window should always be 1/3 of the window size.
increment = window_size/3
Part (d)
Shila’s friend Jean thinks a zero-crossing classification rule will perform just as well to the Random Forest
classifier.
Build a classification rule for detecting {L,R} under a streaming condition, using the number of zero-
crossings in a window to identify events (from Lab 3 Exercise 1.3), and using the min-max rule to classify
5
events into “Left” or “Right” (Lab 3 Exercise 2.3). Use the optimal window size identified in part (c) as the
window size.
Jean also thinks multiple thresholds must be evaluated to find the best zero-crossings classification rule.
Compare the performance of the zero-crossings classification rule using multiple thresholds on the short wave
files. Which threshold gives the best performance? Justify your answer with appropriate visualisations.
Part (e)
For both the best models that you found in part (c) and part (d), evaluate its performance on sequences
of varying lengths. Does the length of the sequence have an impact on the classification accuracy? Which
classifier performs the best on this data set, and why might you choose one over the other? Justify your
answer with appropriate visualisations.
Question 3 Part (b) appendix
ts_features_classifier = function(wave_file,
window_multiplier = 1) {
window_size = wave_file@samp.rate*window_multiplier
increment = window_size/3
Y = wave_file@left
xtime = seq_len(length(Y))/wave_file@samp.rate
predicted_labels = c()
window_lb = 1
max_time = length(Y)
while(max_time > window_lb + window_size) {
window_ub = window_lb + window_size
window = Y[window_lb:window_ub]
event =
if (event) {
predicted =
predicted_labels = c(predicted_labels, predicted)
window_lb = window_lb + window_size
} else {
window_lb = window_lb + increment
}
}
return(paste(predicted_labels, collapse = ""))
}
6

学霸联盟
essay、essay代写