NOTE:
Calculators are permitted
There are 5 questions, with a total of 125 marks.
STATS 369
Page 2 of 6
1. In the following R code
ms <- src(MonetDBLite::src_monetdblite(“WORDS/DB”))
glove <- tbl(ms, “glove”)
db_words <- copy_to(ms, current_words)
word_mat <- db_words %>%
inner_join(glove, by=”word”) %>%
select(-word) %>%
as.matrix()
(a) What does inner_join do?
(5 marks)
(b) What does select(-word) do?
(5 marks)
(c) At what point does the SQL query involving the inner join get run?
(5 marks)
(d) Give an advantage and a disadvantage of working with data stored in a
database
rather than in memory.
(5 marks)
(20 marks total)
STATS 369
Page 3 of 6
2. Random forests and Adaboost both predict using averages of trees, but
the trees are
constructed differently
(a) Briefly describe the two algorithms: in particular, the differences
in how the observations
and variables considered in training within each node are selected or
weighted.
(15 marks)
(b) Use the differences to explain:
(i) Why increasing the number of trees will not cause overfitting with
random forests,
but may cause overfitting with Adaboost.
(5 marks)
(ii) Why it is easier to take advantage of parallel computing for random
forests than
for Adaboost.
(5 marks)
(iii) Why the individual trees in random forests are typically grown to
full depth, but in
those in Adaboost are typically shallow.
(5 marks)
(30 marks total)
STATS 369
Page 4 of 6
3. Consider the multilayer neural network described by the following R
keras code
model <- keras_model_sequential() %>%
layer_conv_2d(filters = 32, kernel_size = c(3,3),
activation = 'relu', input_shape = input_shape) %>%
layer_conv_2d(filters = 64, kernel_size = c(3,3),
activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_dropout(rate = 0.25) %>%
layer_flatten() %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(rate = 0.5) %>%
layer_dense(units = num_classes, activation = 'softmax')
(a) The layer_conv_2d() function declares a convolutional layer. What is
a
convolutional layer, and what do the arguments filters = 32, kernel_size
=
c(3,3) mean?
(15 marks)
(b) How many trainable parameters does this layer have?
(4 marks)
(c) What does layer_max_pooling_2d do, and what does pool_size mean?
(5 marks)
(d) What does layer_dropout(rate = 0.5) do?
(3 marks)
(e) What does layer_dense(units = 128, activation = 'relu') do?
(3 marks)
(30 marks total)
STATS 369
Page 5 of 6
4. Briefly describe at least one way regularization is accomplished in
each of
(a) subset selection for linear regression
(5 marks)
(b) random forests
(5 marks)
(c) neural networks
(5 marks)
(d) boosted trees
(5 marks)
(20 marks total)
STATS 369
5. Last year, a Stanford University psychologist, Michael Kosinski, and
colleagues published
a paper on neural network analysis of images scraped from a dating
website. He found that in
this dataset the network could predict sexual orientation from one image
with 81% accuracy
for men and 71% for women, and that this was better than the accuracy of
untrained highspeed human classification using workers on the Amazon
Turk website.
a) How would the use of images from a dating website be expected to bias
the estimates
of accuracy?
(5 marks)
b) The accuracy figures given are for a sample that is 50% heterosexual
and 50%
homosexual. Suppose that the sensitivity and specificity of the
classified in men are
both 0.8. What are the positive and negative predictive value for
homosexuality in a
population that is 5% homosexual and 95% heterosexual?
(5 marks)
c) The researchers say that their aim was to publicise the risks of
automated
identification of sexual orientation. Given this aim, briefly discuss
the ethical
justification of the research with reference to the ethical principles
of beneficience,
respect for persons, and justice.
(15 marks)
(25 marks total)