data3888代写-DATA3888-Assignment 1
DATA3888 (2021): Assignment 1
Question 1: Brain-box
Build a classification rule for detecting {L, R} under streaming condition where the function will take a
sequence of signal as an input. Note, this is slightly different to detecting {L, R} for a given sequence.
• (i) Estimate the accuracy of your classifier. Is your value reasonable?
• (ii) Dose the length of the sequence impact on the performance of your classifier?
(a) Consider what metric you will use to define “performance”? You will need to explain your choice and
justify your answer.
(b) You can use data generated by either Louis ( or Zoe (
(c) The code below is a guide only, you do not need to follow the structure.
streaming_classifier = function(wave_file,
window_size = wave_file@samp.rate,
increment = window_size/10,

Y = wave_file@left
xtime = seq_len(length(wave_file@left))/wave_file@samp.rate
predicted_labels = c()
lower_interval = 1
max_time = max(xtime)*window_size
while(max_time > lower_interval + window_size)
upper_interval = lower_interval + window_size
interval = Y[lower_interval:upper_interval]
predicted =
predicted_labels = c(predicted_labels, predicted)
lower_interval = lower_interval + increment
} ## end while
}## end function
Question 2: Biomedical COVID19 data
Consider the prevalidation principle where a molecular signature (set of features) from a given omics
data platform is used to obtain a single variable known as prevalidated outcome. Next, we model this
prevalidated outcome in combination with the others other clinical variables to build a classifier of outcome
of interest. In this exercise, ignoring healthy individual,
• (i) build a classifier to predict disease outcome (moderate vs severe), including a feature selection
component on the proteomics data. Illustrate your comparison results using boxplot (similar to
the sample code in #3.6); and
• (ii) generate a prevalidated outcome from the proteomics data and use it together with the clinical
variables in a logistic regression to build a classifier.
Describe your final model for classifying severe and non-severe individuals and your estimate of its accuracy.
Note: The prevalidation procedure similar in concept to cross-validation procedure is detailed and graphi-
cally presented below. The 5-steps are:
• Step 1. Divide the samples into k equal parts.
• Step 2. Set aside one part as the test set component.
• Step 3. A protein signature (set of features) is obtained using the training set (k − 1 parts), and a
classifier is trained on the training set on the protein signature.
• Step 4. Use this classifier to predict the survival class of the kth part (from Step 2).
• Step 5. Repeat steps 2-4 for all k parts, resulting in a prevalidated vector of estimates for the protein
data. This prevalidated vector (denoted as APV) is a complete prediction vector with one prediction
for each sample.
Question 3: Lag time estimation
For the month of March to May in 2020, estimate the lag time between number of daily new cases (new_cases)
and the number of hospital patients (hosp_patients) for all countries with data available and display
your results on the world map. Is this visualisation appropriate in this context? Please explain your response
and recommend a better choice if you don’t think this is appropriate (illustration is welcome).
[Bonus question] For the month of August to November in 2020, estimate the lag time between number
of daily new cases (new_cases) and and the number of hospital patients (hosp_patients). Compare this
estimate with the one between March to May in 2020. Describe your observation, what did you learn from
this data?