This assessment is due 12:00pm (mid-day) Friday 26th March 2021.
You are required to submit two files. One file should contain your complete
solutions, including relevant output and plots obtained when running your R
code. The other file should be a .R file containing your annotated R code used
to obtain your solutions. You will not receive any marks for submitting only a
.R file.
Note that is it not necessary to include code used for data manipulation and
plotting in your .R file. Plotting and data manipulation may be performed in
any suitable software package.
In linguistics, a phoneme is the smallest unit of sound that distinguishes one word from
another. This question concerns classification of the phonemes “aa”, as in the sound of the
vowel “a” in the word “dark”, and “ao” , as in the sound of the vowel “a” in the word
“water”, using samples from the TIMIT speech recognition database (TIMIT Acoustic-
Phonetic Continuous Speech Corpus, NTIS, US Dept of Commerce).
The data for this question were obtained from speech frames of phonemes taken from samples
of continuous speech by di↵erent speakers. The data are available on Blackboard as ASCII
files and consists of a column labelled “speaker”, a response column labelled “g” and 256
columns labelled “x.1” - “x.256”. Each row of “x.1” - “x.256” is a log-periodogram computed
from a speech frame of either “aa” or “ao” measured at 256 frequencies. A log-periodogram
is a widely used method for converting speech to a form suitable for speech recognition.
The training data is in file phoneme_train.txt and the test data is in phoneme_test.txt.
(a) Using the training data, draw line plots of a sample of 20 log-periodograms for the
phoneme “aa” and a sample of 20 log-periodograms for the phoneme “ao” against fre-
quencies 1 to 256, on the same graph. The log-periodogram values should be on the
y-axis and the frequencies on the x-axis.
Comment on the plots.
[5 marks]

(b) Perform logistic regression on the training data in order to predict a phoneme using its
log-periodogram and obtain the confusion matrix and training error rate. Do not include
an intercept term in your model.
Also compute the test error rate and a bootstrap 95% confidence interval for the test
error based on 1000 bootstrap estimates. You should include your code for computing
the bootstrap estimate in the hard copy of your solution.
Comment on your results.
[20 marks]
(c) Repeat part (b) using a QDA model and comment on the results. Also compare the
error rates with those obtained for the logistic regression model.
[20 marks]
(d) Here we investigate improving the test error rate of the logistic regression model in part
(b) by constructing a simple filter.
0 50 100 150 200 250
The figure above is a plot of the estimated parameters ˆ1, . . . , ˆ256 from the fitted logistic
model in part (b) with predictors x.1,. . .,x.256 against the frequencies 1, . . . , 256. The
rapid fluctuations seen in the plot indicates strong negative correlation between neigh-
bouring estimates and is due to the neighbouring frequencies in the speech frames being
highly positively correlated.
We now wish to construct a logistic regression model in which the parameter estimates
are forced to vary smoothly with frequency.
(i) Write R code to generate 13 natural cubic splines basis functions with knots uni-
formly placed over the integers 1, 2, . . . , 256, representing the frequencies, and con-
struct the 256⇥ 13 basis matrix B.
What are the elements in the last 6 rows of your matrix B?
Note: To avoid the elements in B becoming too large, you should rescale the fre-
quencies 1, 2, . . . , 256 to take values between 0 and 1 . The following code constructs
13 knots evenly placed between the rescaled frequencies.
knots<-quantile(1:256/256, probs=seq(0, 1, length.out=13))
[22 marks]
(ii) “Filter” the predictors x = (x.1,. . .,x.256) in the training data by computing x⇤ =
xB and fit a linear logistic regression model (without an intercept term) using x⇤
as your predictor variables.
What are the values of the parameter estimates ˆ⇤1 , . . . , ˆ

13 in your model?
[5 marks]
(iii) Construct the plot of ˆ1, . . . , ˆ256 from the model in part (b) against frequen-
cies 1, . . . , 256, as shown in the figure above, and superimpose on it a plot of the
smoothed parameter estimates ˆsmooth = Bˆ⇤, where ˆ⇤ is the vector
264 ˆ

375 .
Comment on your smoothed curve.
[8 marks]
(iv) Obtain the training and test confusion matrices and error rates for your model
based on the filtered predictors.
[15 marks]
(e) Compare the error rates obtained in part (d) with those obtained for the logistic regres-
sion model in part (b) and write a summary of your findings.
[5 marks]