R代写|Statistics统计代写 - STATISTICS Data Science Practice
NOTE: Calculators are permitted There are 5 questions, with a total of 125 marks. STATS 369 Page 2 of 6 1. In the following R code ms <- src(MonetDBLite::src_monetdblite(“WORDS/DB”)) glove <- tbl(ms, “glove”) db_words <- copy_to(ms, current_words) word_mat <- db_words %>% inner_join(glove, by=”word”) %>% select(-word) %>% as.matrix() (a) What does inner_join do? (5 marks) (b) What does select(-word) do? (5 marks) (c) At what point does the SQL query involving the inner join get run? (5 marks) (d) Give an advantage and a disadvantage of working with data stored in a database rather than in memory. (5 marks) (20 marks total) STATS 369 Page 3 of 6 2. Random forests and Adaboost both predict using averages of trees, but the trees are constructed differently (a) Briefly describe the two algorithms: in particular, the differences in how the observations and variables considered in training within each node are selected or weighted. (15 marks) (b) Use the differences to explain: (i) Why increasing the number of trees will not cause overfitting with random forests, but may cause overfitting with Adaboost. (5 marks) (ii) Why it is easier to take advantage of parallel computing for random forests than for Adaboost. (5 marks) (iii) Why the individual trees in random forests are typically grown to full depth, but in those in Adaboost are typically shallow. (5 marks) (30 marks total) STATS 369 Page 4 of 6 3. Consider the multilayer neural network described by the following R keras code model <- keras_model_sequential() %>% layer_conv_2d(filters = 32, kernel_size = c(3,3), activation = 'relu', input_shape = input_shape) %>% layer_conv_2d(filters = 64, kernel_size = c(3,3), activation = 'relu') %>% layer_max_pooling_2d(pool_size = c(2, 2)) %>% layer_dropout(rate = 0.25) %>% layer_flatten() %>% layer_dense(units = 128, activation = 'relu') %>% layer_dropout(rate = 0.5) %>% layer_dense(units = num_classes, activation = 'softmax') (a) The layer_conv_2d() function declares a convolutional layer. What is a convolutional layer, and what do the arguments filters = 32, kernel_size = c(3,3) mean? (15 marks) (b) How many trainable parameters does this layer have? (4 marks) (c) What does layer_max_pooling_2d do, and what does pool_size mean? (5 marks) (d) What does layer_dropout(rate = 0.5) do? (3 marks) (e) What does layer_dense(units = 128, activation = 'relu') do? (3 marks) (30 marks total) STATS 369 Page 5 of 6 4. Briefly describe at least one way regularization is accomplished in each of (a) subset selection for linear regression (5 marks) (b) random forests (5 marks) (c) neural networks (5 marks) (d) boosted trees (5 marks) (20 marks total) STATS 369 5. Last year, a Stanford University psychologist, Michael Kosinski, and colleagues published a paper on neural network analysis of images scraped from a dating website. He found that in this dataset the network could predict sexual orientation from one image with 81% accuracy for men and 71% for women, and that this was better than the accuracy of untrained highspeed human classification using workers on the Amazon Turk website. a) How would the use of images from a dating website be expected to bias the estimates of accuracy? (5 marks) b) The accuracy figures given are for a sample that is 50% heterosexual and 50% homosexual. Suppose that the sensitivity and specificity of the classified in men are both 0.8. What are the positive and negative predictive value for homosexuality in a population that is 5% homosexual and 95% heterosexual? (5 marks) c) The researchers say that their aim was to publicise the risks of automated identification of sexual orientation. Given this aim, briefly discuss the ethical justification of the research with reference to the ethical principles of beneficience, respect for persons, and justice. (15 marks) (25 marks total)