DATA3888 (2021): Assignment 1
Question 1: Brain-box
Build a classification rule for detecting {L, R} under streaming condition where the function will take a
sequence of signal as an input. Note, this is slightly different to detecting {L, R} for a given sequence.
• (i) Estimate the accuracy of your classifier. Is your value reasonable?
• (ii) Dose the length of the sequence impact on the performance of your classifier?
Hint:
(a) Consider what metric you will use to define “performance”? You will need to explain your choice and
justify your answer.
(b) You can use data generated by either Louis (Spiker_box_Louis.zip) or Zoe (zoe_spiker.zip).
(c) The code below is a guide only, you do not need to follow the structure.
streaming_classifier = function(wave_file,
window_size = wave_file@samp.rate,
increment = window_size/10,
)
{
Y = wave_file@left
xtime = seq_len(length(wave_file@left))/wave_file@samp.rate
predicted_labels = c()
lower_interval = 1
max_time = max(xtime)*window_size
while(max_time > lower_interval + window_size)
{
upper_interval = lower_interval + window_size
interval = Y[lower_interval:upper_interval]
predicted =
predicted_labels = c(predicted_labels, predicted)
lower_interval = lower_interval + increment
} ## end while
}## end function
Question 2: Biomedical COVID19 data
Consider the prevalidation principle where a molecular signature (set of features) from a given omics
data platform is used to obtain a single variable known as prevalidated outcome. Next, we model this
prevalidated outcome in combination with the others other clinical variables to build a classifier of outcome
of interest. In this exercise, ignoring healthy individual,
1
• (i) build a classifier to predict disease outcome (moderate vs severe), including a feature selection
component on the proteomics data. Illustrate your comparison results using boxplot (similar to
the sample code in #3.6); and
• (ii) generate a prevalidated outcome from the proteomics data and use it together with the clinical
variables in a logistic regression to build a classifier.
Describe your final model for classifying severe and non-severe individuals and your estimate of its accuracy.
Note: The prevalidation procedure similar in concept to cross-validation procedure is detailed and graphi-
cally presented below. The 5-steps are:
• Step 1. Divide the samples into k equal parts.
• Step 2. Set aside one part as the test set component.
• Step 3. A protein signature (set of features) is obtained using the training set (k − 1 parts), and a
classifier is trained on the training set on the protein signature.
• Step 4. Use this classifier to predict the survival class of the kth part (from Step 2).
• Step 5. Repeat steps 2-4 for all k parts, resulting in a prevalidated vector of estimates for the protein
data. This prevalidated vector (denoted as APV) is a complete prediction vector with one prediction
for each sample.
Question 3: Lag time estimation
For the month of March to May in 2020, estimate the lag time between number of daily new cases (new_cases)
and the number of hospital patients (hosp_patients) for all countries with data available and display
your results on the world map. Is this visualisation appropriate in this context? Please explain your response
and recommend a better choice if you don’t think this is appropriate (illustration is welcome).
[Bonus question] For the month of August to November in 2020, estimate the lag time between number
of daily new cases (new_cases) and and the number of hospital patients (hosp_patients). Compare this
estimate with the one between March to May in 2020. Describe your observation, what did you learn from
this data?
2