程序代写案例-FIT3152
时间:2021-06-12
Page 1 FIT3152 Mock eExam with brief Answers/Marking Guide R Coding: 10 Marks > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa The following R code is run: Petal.cor <- as.data.frame(as.table(by(iris, iris[5], function(df) cor(df[3], df[4])))) colnames(Petal.cor) <- c("Species", "Petal.cor") Sepal.cor <- as.data.frame(as.table(by(iris, iris[5], function(df) cor(df[1], df[2])))) colnames(Sepal.cor) <- c("Species", "Sepal.cor") iris.cor <- merge(Sepal.cor, Petal.cor, by = "Species") iris.cor[,2] = round(iris.cor[,2], digits = 3) iris.cor[,3] = round(iris.cor[,3], digits = 3) write.csv(iris.cor, file = "Iris.cor.csv", row.names=FALSE) Describe the action and outputs of the R code. Calculate the correlation of sepal length and width [1 Mark] Calculate the correlation of petal length and width [1 Mark] Rename and merge data frames [1 Mark] Round the values [1 Mark] Save as a csv file [1 Mark] Describe the action of each function or purpose of each variable in the space provided. as.data.frame Coerce the previous output into a data frame [1 Mark] merge Merge data frames using a common column as an index [1 Mark] by Apply a function to a data frame split by factors [1 Mark] df Temporary data frame passed to the function [1 Mark] round Round the data to a given number of decimal places (or digits) [1 Mark] (10 marks) Page 2 Regression: 10 Marks A subset of the ‘diamonds’ data set from the R package ‘ggplot2’ was created. The data set reports price, size(carat) and quality (cut, color and clarity) information as well as specific measurements (x, y and z). The first 6 rows are printed below. > head(dsmall) carat cut color clarity depth table price x y z 46434 0.59 Very Good H VVS2 61.1 57 1771 5.39 5.48 3.32 35613 0.30 Good I VS1 63.3 59 473 4.20 4.23 2.67 43173 0.42 Premium F IF 62.2 56 1389 4.85 4.80 3.00 11200 0.95 Ideal H SI1 61.9 56 4958 6.31 6.35 3.92 37189 0.32 Premium D VVS1 62.0 60 973 4.40 4.37 2.72 45569 0.52 Premium E VS2 60.7 58 1689 5.17 5.21 3.15 The least squares regression of log(price) on log(size) and color is given below. Note that ‘log’ in this context means ‘Loge(X).’ Based on this output, answer the following questions. > library(ggplot2) > set.seed(9999) # Random seed > dsmall <- diamonds[sample(nrow(diamonds), 1000), ] # sample of 1000 rows > attach(dsmall) > contrasts(color) = contr.treatment(7) > d.fit <- lm(log(price) ~ log(carat) + color) > d.fit > summary(d.fit) Call: lm(formula = log(price) ~ log(carat) + color) Residuals: Min 1Q Median 3Q Max -0.97535 -0.16001 0.01106 0.15500 0.99937 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.61356 0.02289 376.259 < 2e-16 *** log(carat) 1.74075 0.01365 127.529 < 2e-16 *** color2 -0.06717 0.02833 -2.371 0.0179 * color3 -0.05469 0.02783 -1.965 0.0496 * color4 -0.07139 0.02770 -2.578 0.0101 * color5 -0.21255 0.02973 -7.148 1.7e-12 *** color6 -0.32995 0.03175 -10.393 < 2e-16 *** color7 -0.50842 0.04563 -11.143 < 2e-16 *** --- Residual standard error: 0.2393 on 992 degrees of freedom Multiple R-squared: 0.9446, Adjusted R-squared: 0.9443 F-statistic: 2418 on 7 and 992 DF, p-value: < 2.2e-16 > contrasts(color) 2 3 4 5 6 7 D 0 0 0 0 0 0 E 1 0 0 0 0 0 F 0 1 0 0 0 0 G 0 0 1 0 0 0 H 0 0 0 1 0 0 I 0 0 0 0 1 0 J 0 0 0 0 0 1 Page 3 (a) Write down the regression equation predicting log(price) as a function of size and color. log(price) = 1.74 * log(carat) + 8.61 + color(i), where i = indicates color(D,E,F,G,H,I,J) [1 Mark] (b) Explain the different data types present in the variables: carat and color. What is the effect of this difference on the regression equation? carat is a numerical variable (treated as a number) [1 Mark] color is a factor – it is included in the regression equation as a contrast whereby each level is estimated individually. [1 Mark] (c) What is the predicted price for a diamond of 1 carat of color H? log(price) = 1.74 * log(carat) + 8.61 + color(i), log(price) = 1.74 * log(1) + 8.61 -0.21, log(price) = 1.74 * 0 + 8.61 -0.21, log(price) = 8.61 -0.21 = 8.40 price = e ^ 8.40 = $ 4447.06 [1 Mark] (d) Which colour diamonds can be reliably assumed to have the highest value? Explain your reasoning. How sure can you be? Color D diamonds have the highest value since the coefficient for this factor is 0 and all the others are negative. [1 Mark] For surety, use the significance of the regression equation overall (***) so better than 0.0001 [1 Mark] (e) Which colour diamonds have the lowest value? How reliable is the evidence? Explain your reasoning. Color J diamonds have lowest value (coeff = -0.51) [1 Mark] Significance better than 0.0001 [1 Mark] (f) Comment on the reliability of the model as a whole giving reasons. Reliability of model is high overall: Multiple R-squared = 0.94, p-value very small, median residual close to 0. [1 Mark each up to 2 Marks] Page 4 Networks: 10 Marks The social network of a group of friends (numbered from 1 – 7) is drawn below. (a) Calculate the betweenness centrality for nodes 4 and 6. Node(4) betweenness = 11 [1 Mark] (It is in the following geodesics: 1-5, 1-6, 1-7, 2-5, 2-6, 2- 7, 3-5, 3-6, 3-7, 1-3, 2-3.) Node(6) betweenness = 8 [1 Mark] (It is in the following: 1-5, 1-7, 2-5, 2-7, 3-5, 3-7, 4-5, 4-7.) (b) Calculate the closeness centrality for nodes 4 and 6. Node(4) closeness = 1/9 [1 Mark] (Since sum of shortest paths to others = 2 + 1 + 1 + 1 + 2 + 2.) Node(6) closeness = 1/10 [1 Mark] (3 + 2 + 2 + 1 + 1 + 1) (c) Calculate the degree of nodes 4 and 6. |4| = 3 [1 Mark] |6| = 3 [1 Mark] (d) Giving reasons based on your results in Parts a – c, which node is most central in the network? Node(4) is most central [1 Mark] It has the greatest betweenness centrality and closeness centrality [1 Mark] (e) Write down the adjacency matrix for the network. 1 2 3 4 5 6 7 1 0 1 0 0 0 0 0 2 1 0 0 1 0 0 0 3 0 0 0 1 0 0 0 4 0 1 1 0 0 1 0 5 0 0 0 0 0 1 1 6 0 0 0 1 1 0 1 7 0 0 0 0 1 1 0 Correct form [1 Mark], Correct values [1 Mark] 1 2 3 4 6 7 5 Page 5 Naïve Bayes: 4 Marks (a) Use data below and Naïve Bayes classification to predict whether the following test instance will be happy or not. Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? ) Test instance: (Age Range = young, Occupation = professor, Gender = F, Happy = ? ) p(Happy = yes) 0.5 p(Happy = no) 0.5 [1 Mark] YES P(young/yes) P(professor/yes) P(F/yes) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj) p(yes) 0.5 0.250 0.250 0.500 0.016 NO P(young/no) P(professor/no) P(F/no) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj) p(no) 0.5 0.250 0.250 0.750 0.023 Correct calculations [1 Mark] So classify as Happy = No [1 Mark or H] (b) Use the complete Naïve Bayes formula to evaluate the confidence of predicting Happy = yes, based on the same attributes as the previous question: (Age Range = young, Occupation = professor, Gender = F). NUM P(young/yes) P(professor/yes) P(F/yes) P(Cj)×P(A1| Cj)×P(A2| Cj)× … ×P(An| Cj) p(yes) 0.5 0.250 0.250 0.500 0.016 DENOM P(young) P(professor) P(F) P(A1)×P(A2)× … ×P(An) 0.250 0.250 0.625 0.039 So p(yes|attributes) = 0.016/0.039 = 0.41 [1 Mark or H] ID Age Range Occupation Gender Happy 1 Young Tutor F Yes 2 Middle-aged Professor F No 3 Old Tutor M Yes 4 Middle-aged professor M Yes 5 Old Tutor F Yes 6 Young Lecturer M No 7 Middle-aged lecturer F No 8 Old Tutor F No Page 6 Visualisation: 6 Marks A World Health study is examining how life expectancy varies between men and women in different countries and at different times in history. The table below shows a sample of the data that has been recorded. There are approximately 15,000 records in all. Country Year of Birth Gender Age at Death Australia 1818 M 9 Afghanistan 1944 F 40 USA 1846 F 12 India 1926 F 6 China 1860 F 32 India 1868 M 54 Australia 1900 F 37 China 1875 F 75 England 1807 M 15 France 1933 M 52 Egypt 1836 M 19 USA 1906 M 58 Using one of the graphic types from the Visualization Zoo (see formulae and references for a list of types) suggest a suitable graphic to help the researcher display as many variables as clearly as possible. Explain your decision. Which graph elements correspond to the variables you want to display? Appropriate main graphic [1 Mark] with explanation. [1 Mark] Mapping of variables to attributes in the graphic and/or data reduction (summary) as appropriate with explanation. [1 Mark each up to 4 Marks] For example, one approach would be a heat map with time intervals on the x axis (perhaps every 10 or 50 years depending on range) and continents or countries on the y axis (depending on how many countries there are). Each cell could then be coloured for average age of death. You could either have two heat maps (male/female) or interleave cells so that m/f for each time period were adjacent. Page 7 Decision Trees: 10 Marks Eight university staff completed a questionnaire on happiness. The results are given below. A decision tree was generated from the data. (a) Using the decision tree generated from the data provided, assuming a required confidence level greater than 60% to classify as ‘Happy’, what is the predicted classification for the following instances: Instance 1: (Age Range = Young, Occupation = Professor, Gender = F, Happy = ? ) Instance 2: (Age Range = Old, Occupation = Professor, Gender = F, Happy = ? ) Instance 1: Happy = No, because confidence for Happy = Yes is 50%, which is less than required confidence level. [1 Mark] Instance 2:Happy = Yes, because confidence for Happy = Yes is 66.67%, which is greater than required confidence level. [1 Mark] (b) Is it possible to generate a 100% accurate decision tree using this data? Explain your answer. Instances 5 and 8 have identical decision attributes, but belong to different classes, so No (Old, Tutor, F = Yes; Old, tutor, F = No). Therefore a 100% accurate decision tree could not be generated from this data. (Or equivalent) [1 Mark] ID Age Range Occupation Gender Happy 1 Young Tutor F Yes 2 Middle-aged Professor F No 3 Old Tutor M Yes 4 Middle-aged Professor M Yes 5 Old Tutor F Yes 6 Young Lecturer M No 7 Middle-aged Lecturer F No 8 Old Tutor F No Page 8 (c) Explain how the concept of entropy is used in some decision tree algorithms. Information gain is used in the ID3 algorithm to determine which attribute to split on. Information gain calculates the reduction in entropy when splitting on a specific attribute and chooses the attribute which gives the greatest reduction in entropy or greatest information gain. (Or something similar) [1 Mark] (d) Do you think entropy was used to generate the decision tree above? Explain your answer. The Occupation attribute appears more homogeneous in terms of the class attribute Happy than the Age attribute. (Or Similar) [1 Mark] Therefore, no, entropy was not used. (or similar) [1 Mark] (e) What is the entropy of “Happy”? 50:50 Yes:No = 1 by inspection. [1 Mark] (e) What is information gain after the first node of the decision tree (Age Range) has been introduced? (: ) = − ∙ ( ) − ∙ ( ) = . [1 Mark] (, ) = () − ( . + . + . ) (, ) = − (. ) = . [1 Mark] (f) Explain why some decision tree algorithms are referred to as greedy algorithms. Decision tree algorithms always choose the best option to branch on at each step without taking into account future choices. Is never able to back track in order to improve the final solution. [1 Mark] Page 9 ROC and Lift: 10 Marks The following table shows the outcome of a classification model for customer data. The table lists customers by code and provides the following information: The model confidence of a customer buying/not buying a new product (confidence-buy); whether in fact the customer did or did not buy the product (buy = 1 if the customer purchased the model, buy = 0 if the customer did not buy the model). customer confidence-buy buy-not-buy 20%+ 80%+ c1 0.9 1 1 1 c2 0.8 1 1 1 c3 0.7 0 1 0 c4 0.7 1 1 0 c5 0.6 1 1 0 c6 0.5 1 1 0 c7 0.4 0 1 0 c8 0.4 1 1 0 c9 0.2 0 1 0 c10 0.1 0 0 0 (a) Calculate the True Positive Rate and the False Positive Rate when a confidence level of 20% is required for a positive classification. TP = 6, FP = 3, TN = 1, FN = 0. All correct 1 Mark TPR = 6/(6+0) = 1, FPR = 3/(3+1) = 0.75. All correct 1 Mark (b) Calculate the True Positive Rate and the False Positive Rate when a confidence level of 80% is required for a positive classification. TP = 2, FP = 0, TN = 4, FN = 4. All correct 1 Mark TPR = 2/(2+4) = 0.33, FPR = 0/(0+4) = 0. All correct 1 Mark (c) The ROC chart for the previous question is shown below. Comment on the quality of the model overall. Give a single measure of classifier performance. Exact = 0.83 accept 0.7 – 0.9. [1 Mark] Classifier is good. [1 Mark] (d) What is the lift value if you target the top 40% of customers that the classifier is most confident of? P(true) = 6/10, for top 40% P(true) = 3/4 [1 Mark] Lift = (3/4) / (6/10) = 1.25 [1 Mark or H] Page 10 (e) Explain what the value of lift means in the previous question. Lift is the increase in the response rate over randomly selection [1 Mark] by choosing those you are most confident of. [1 Mark] Clustering: 10 Marks (a) What does the ‘k’ refer to in k-means clustering. Who/what determines the value of k? K is the number of clusters. [1 Mark] This is pre-defined by the user before running the algorithm. [1 Mark] (b) Describe the steps involved with k-means clustering. 1. Define the number of clusters required, k. [1 Mark] 2. Declare k centroids. [1 Mark] 3. Assign each data point to the closest centroid; 4. Re- calculate the centroids. [1 Mark] 5. Repeat 3 and 4 until the cluster centroids do not change. [1 Mark] (c) Are clustering algorithms supervised or unsupervised learning algorithms? Explain. They are unsupervised algorithms designed to find groupings of similar instances. [1 Mark] Unlike classification, there is no ‘class’ attribute that can be used to help determine the clusters. [1 Mark] (b) Is the k-means clustering algorithm a partitional or hierarchical clustering algorithm? Explain your answer. k-Means is partitional. [1 Mark] There is no hierarchy from which clusters can be chosen. The number of clusters cannot be changed, once set. [1 Mark] Page 11 Text Analytics: 10 Marks 9 (a) Explain what is meant by the ‘bag of words’ approach to text mining. Each document in the collection is assumed to be just a set of words and it is the entire collection of words that is used in the analysis. [1 Mark] The semantics or meaning of the text in the documents is not considered in the ‘bag of words’ approach. [1 Mark] (b) What is the main problem associated with the bag of words approach? Provide an example. The main problem is that semantics are not considered and two documents that mean quite different things, but contain the same words, will be considered to be similar. [1 Mark] Example: • while licking their ice creams, the children chased the dog • the dog chased the children and licked their ice creams [1 Mark] (c) Describe an application where text mining could be used, giving an example of how it would be applied. Grouping articles by similar content (or similar). [1 Mark] For example, job applications, tweets, emails etc. [1 Mark] (d) Apply the five main steps required to pre-process text documents for analysis to the corpus below. Write your processed documents in the space provided. Doc1 = { The choir sang loudly. } Doc2 = { The boys were singing in church. } Doc3 = { The boy asked to sing a song. } (choir, sing-, loud-) (boy, sing-, church) [Tokenise and stop words 1 Mark] (boy, ask, sing- song). [Stemming and overall format 1 Mark] (e) Construct the term document frequency matrix for the processed text documents above. [2 Marks]. Matrix correct format: words = cols, docs = rows [1 Mark] Indicators are correct [1 Mark or H] ask boy choir church loud- sing- song Doc 1 0 0 1 0 1 1 0 Doc 2 1 0 1 0 1 0 Doc 3 1 1 0 0 0 1 1 Page 12 Ensemble Methods 4 Marks (a) Describe the main similarities of the three ensemble classifiers (bagging, boosting and random forests) studied. Create multiple data sets by resampling or cloning [1 Mark] Build multiple classifiers [1 Mark] Combine classifiers (ave or vote) [1 Mark up to a total of 2] (b) How do boosting and random forests differ from bagging? Boosting re-weights attributes to favour hard to classify cases. [1 Mark] Random Forests varies the attributes used in samples as well. [1 Mark] Dirty Data 6 Marks The table below is an extract from the list of books in the British Library. Identify the instances of dirty data present, stating the way in which the data is dirty. Most of these are instances of incorrect data, although many records are incomplete also.[1 Mark each up to maximum 6]. 1 = incorrect/duplicate (has publisher and place in same cell. 2 = incorrect/duplicate etc, 3 = incorrect/inaccurate using abbreviation for “Oxford”, 4 = incorrect/inaccurate etc. Page 13 Formulas and references








































































































































































































































































































































































































































































































































































































学霸联盟


essay、essay代写