Project #4
MA 402
Mathematics of Scientific Computing
Due: Wed, December 1
Writeup (4 pts)
In general, your writeup for each part should include the following:
• A short introduction explaning the nature and purpose of the assignment.
• A brief description of your code or algorithms, if necessary. Ideally, your
actual code is supplementary (rather than copy/pasted into the report): a
reader should be able to understand the key ideas behind your programs
without having to read the code itself.
• A section for your plots/tables/figures, appropriately labeled and with
further description if necessary.
• A paragraph for your conclusions. What valuable life lessons did you draw
from the assignment?
The aim is that a student who knows a little about the core material, but has
not seen or worked on this assignment specifically, should be able to read your
writeup and get an idea of what you did and why.
Files submitted should include the following:
• Your report (PDF/docx), which includes all relevant plots/tables/figures.
• Any code (.m/.py/.ipynb) used for your experiments.
1
Figure 1: Student 29 (average rating 1.2) was a tough customer. Student 11
(average rating 8.0) was not.
Part 1: Joke Recommendations (6 pts)
The file jokeData.mat contains ratings for 42 jokes from 63 students. Here you
will design a method for recommending jokes.
a) The model will use ratings from the first 53 students. The final 10 will be
reserved as a test set for assessing the performance of the recommendation
system.
b) The model will take as input ratings for the 10 jokes with numbers
{1, 2, 8, 15, 22, 26, 32, 33, 36, 42}
and return as output predicted ratings for the remaining 32 jokes.
c) For a test student who rates the 10 listed jokes, find the K = 3 students
out of the training set who have the most similar taste. Use those students’
ratings to predict how the test student would rate the remaining jokes.
d) Assess the quality of your recommendation system, particularly in com-
parison to the baseline model that simply outputs the average rating of
each joke over the training set.
e) (Bonus, zero points) How do the results change for different values of K?
2
Part 2: Linear Classifier (10 pts)
The file cancerData.mat contains observations on 4000 predictors for ovarian
cancer over 216 patients, split into a training group of size 166 and a testing
group of size 50. Here you will use a linear classifier to predict whether a given
patient has cancer.
a) Partly for visualization purposes, start by reducing the training data to
the first 2 principal components. Reduce the testing data using X̂test =
(Xtest − µtrain)Vtrain, so that the training and testing data are reduced
along the same coordinate system.
b) Create a numeric variable for the categorical labels, where 1 represents
cancer and -1 represents no cancer.
c) Fit a linear model to the training data using linear regression.
d) Using τ = 0 as the decision boundary, apply your model to the training
data. What is the overall accuracy rate? Compute the confusion matrix
and interpret the results. Repeat for the testing data; how do the results
compare?
e) Plot ROC and precision-recall curves showing how your model performs
on the training data, treating cancer as the “positive” result. Is there a
metric for this application that you would consider more important than
the overall accuracy rate? At what point along the curve do you think the
model gives the most valuable performance?
f) The matrix obsTrain has full row rank, so it is possible to find a linear
combination of the observations that predicts the training data perfectly.
Why would this not necessarily be a good idea?
Potentially useful commands
• knnsearch for finding K-nearest neighbors
• pdist2 quickly finds pairwise distances
• double, categorical, logical for changing data types
• grp2idx converts categorical to numeric data
• perfcurve for precision-recall and ROC curves
• confusionmat, confusionchart compute and plot confusion matrices
3
Figure 2: Plot of training data along first two principal components; a linear
classifier will not be able to reach 100 percent accuracy.
4