Question Mark Out of A1 8 A2 12 A3 10 A4 6 A5 9 B1 8 B2 7 B3 10 C1 6 C2 12 C3 6 C4 6 TOTAL 100 Page 2 of 43 Note you will need to use Formula Sheet in this unit to answer the questions. Instructions Answer each question in the space provided. You can write in pen or pencil. Marks are indicated next to each question. The total mark for the exam is 100. Part A (45 marks in total) Question A.1 (1+1+1+1+2+1+1=8 marks) Consider the following set of numbers: -25, 2, 3, 8, 10, 14, 18, 21, 32. For each of the questions below, state your answer, showing working if necessary. (a) What is the median? 10 The given data set has already been ranked and the number of values is 9 which is odd, we can apply x(n+1)/2 where n = 9 to get the median (b) What is the 1st quartile? 2.5 The formula to be used here will be Qk = xp + q 4 (xp+1 − xp) p = floor((k(n+ 1))/4) q = (k(n+ 1)) mod 4 where n = 9, k = 1. So p = floor((1× (9 + 1))/4) = 2 q = (1× (9 + 1)) mod 4 = 2 Q1 = x2 + 2 4 (x3 − x2) = 2 + 2 4 × (3− 2) = 2.5 (c) What is the 3rd quartile? 19.5 Using the formula in Q(b), let k = 3, n = 9 p = floor((3× (9 + 1))/4) = 7 q = (3× (9 + 1)) mod 4 = 2 Q3 = x7 + 2 4 (x8 − x7) = 18 + 2 4 × (21− 18) = 19.5 Marks / 3 Page 3 of 43 (d) What is the interquartile range. 17 IQR = Q3 −Q1 = 19.5− 2.5 = 17 (e) Hence sketch a box-plot. Lay it out horizontally below. Be sure to mark the values of the various parts. upper and lower hinges at Q3 and Q1, median at 10, upper and lower whiskers at 2 and 32, one outlier at -25 Marks / 3 Page 4 of 43 (f) You are told the mean of the numbers is 9.222 and the mean of their square is 309.666. What is the sample standard deviation?√ 9/8 ∗ (309.666− 9.2222) = 15.896 Sample standard deviation will be calculated through Sx = √√√√ 1 n− 1 n∑ i=1 (xi − x¯2), so here, n = 9, 9∑ i=1 x2i = 9× 309.666, 9∑ i=1 xi = 9× 9.222. Then 9∑ i=1 (x2i − 2xix¯+ x¯2) = 9∑ i=1 x2i − 2x¯ 9∑ i=1 xi + 9∑ i=1 x¯2 = 9× 309.666− 2× 9.222× 9× 9.222 + 9× 9.2222 = 9× (309.666− 2× 9.2222 + 9.2222) = 9× (309.666− 9.2222) S2x = 1 8 9∑ i=1 (x2i − 2xix¯+ x¯2) = 1 8 × (9× (309.666− 9.2222)) = 9 8 × (309.666− 9.2222). So Sx = √ 9/8 ∗ (309.666− 9.2222) = 15.896 (g) If you only knew the mean and sample standard deviation of the sample, what does Chebyshev’s inequality tell you? That the number of points outside (9.222− k ∗ 15.896, 9.222 + k ∗ 15.896) is less than 9/k2. Chebyshev’s Inequality: P (µ − kσ < X < µ + kσ) ≥ 1 − 1 k2 , where µ is the mean and σ is the value of standard deviation. This inequality is used to describe at least 1− 1/k2 of the distribution’s values are within k standard deviations of the mean, or equivalently, no more than 1/k2 of the distribution’s values can be more than k standard deviations away from the mean. For example, 36 students’ average mark in an exam is 80, the standard deviation is 10, then we may know, the number of students whose mark is less than 50 (k = 3, 3 standard deviation away from the mean) is 4 (= 36× 1 32 ). Marks / 2 Page 5 of 43 Question A.2 (4+2+2+4=12 marks) Throughout this question, show your working and leave your answer in a clear from. Of those reporting to a medical clinic, 2% have medical condition Z. It is assumed that this figure of 2% is also the base rate across the population. There is a test for condition Z such that, for those patients who have condition Z, 85% will test positive; and for those patients who do not have condition Z, 25% will test positive. (a) If a patient tests positive, what is the probability that the patient has condition Z? p(Z,P ) = 1/50 ∗ 17/20; p(not Z, P ) = 49/50 ∗ 1/4 p(P ) = p(Z,P ) + p(not Z, P ) = 262/1000 p(Z|P ) = p(Z,P )/p(P ) = 1/50 ∗ 17/20/(262/1000) = 17/262 = 0.065 Let Z represent a patient has condition Z, P represent tests positive, so not Z means do not have condition Z, then we will calculate the probability as p(Z|P ). According to the Bayes theorem, p(Z|P ) = p(Z,P ) p(P ) = p(P |Z)p(Z)∑ x∈X p(P |x)p(x) where x = Z or x = not Z. So ∑ x∈X p(P |x)p(x) = p(P |Z)p(Z) + p(P |not Z)p(not Z). From the question, we know p(Z) = 0.02, p(not Z) = 0.98, p(P |Z) = 0.85, p(P |not Z) = 0.25. Apply these values into the formula: p(Z|P ) = p(P |Z)p(Z) p(P |Z)p(Z) + p(P |not Z)p(not Z) = 0.85× 0.02 0.85× 0.02 + 0.25× 0.98 ≈ 0.065 After some consideration, it is decided that the test gives too many false positives, and it is decided to modify the test as follows. The new test is simply to administer the original test twice, where it is assumed that these two tests give results that are independent of one another. A patient will be considered to have tested positive on the new test precisely in those cases where both tests on the original test return a positive result. (b) If a patient has condition Z, what is the probability that the patient will test positive on the new test? p(P |Z) = 17/20 and are independent, so simply multiple, p(PP |Z) = 17/20 ∗ 17/20 = 0.7225 Marks / 6 Page 6 of 43 (c) If a patient does not have condition Z, what is the probability that the patient will test positive on the new test? As above but using p(P |not Z) = 1/4, p(PP |not Z) = 1/16 (d) If a patient returns a positive result on this new test, what is the probability that the patient has condition Z? Build it up as done for part (a): p(Z,PP ) = 1/50 ∗ 17/20 ∗ 17/20; p(not Z, PP ) = 49/50 ∗ 1/16 p(PP ) = 1514/20000 p(Z|PP ) = 1/50 ∗ 17/20 ∗ 17/20/(1514/20000) = 289/1514 = 0.191 p(Z,PP ) = p(Z)p(PP |Z) = 1/50 ∗ 17/20 ∗ 17/20 p(not Z, PP ) = p(notZ)p(PP |not Z) = 49/50 ∗ 1/16 p(PP ) = p(Z)p(PP |Z) + p(not Z)p(PP |not Z) = 1/50 ∗ 17/20 ∗ 17/20 + 49/50 ∗ 1/16 = 1514/20000 p(Z|PP ) = p(Z)p(PP |Z) p(PP ) = 1/50 ∗ 17/20 ∗ 17/20 1514/20000 = 289 1514 ≈ 0.191 Marks / 6 Page 7 of 43 Question A.3 (2+3+3+2=10 marks) Consider the probability density func- tion given at the right, defined by p(x) = 1 2x : 0 ≤ x ≤ 1 0.5 : 1 ≤ x ≤ 2 0.25 : 2 ≤ x ≤ 3 0 : otherwise Consider the cumulative density func- tion P (x) corresponding to p(x), and the quantile function Q(p). (a) What is P (0.5) and Q(0.375)? Do not need a correct quantile function, can read off graph. For P (0.5), P (1) = 0.25, and geometrically, P (0.5) is a quarter of that, so P (0.5) = 0.0625. If you want to do it the long way you can integrate: P (x = 0.5) = ∫ x=0.5 −∞ p(y)dy = ∫ 0 −∞ p(y)dy + ∫ 0.5 0 p(y)dy = 0 + ∫ 0.5 0 1 2 ydy = [ 1 2 · 1 2 y2 ]0.5 0 = 1 16 = 0.0625 For Q(0.375): We want Q(p = 0.375) and p is an area under p(x). We note by geometry the area between 0 ≤ x < 1 is 0.25, so then we just need the x value in the second rectangle that increases the area to 0.375. The area is 0.375− 0.25 = 0.125. In 1 ≤ x < 2, the height of the curve is 0.5, so 0.125 = 0.5(x− 1)→ 0.25 = x− 1→ x = 1.25. (b) Derive the function for P (x). Have to do the intergral for the first part, for 0 < x < 1, and the rest we can just write down Marks / 2 Page 8 of 43 geometrically, since it will be linear (since we just integrate constants). P (x) = 0 : x ≤ 0 1 4x 2 : 0 < x ≤ 1 0.25 + 0.5(x− 1) = 0.5x− 0.25 : 1 ≤ x ≤ 2 0.75 + 0.25(x− 2) = 0.25x+ 0.25 : 2 < x < 3 1 : x ≥ 3 Here are the steps if you want to do all the integrals: P (x) = 0 , x < 0 0 + ∫ x 0 y 2 dy , 0 ≤ x < 1 0.25 + ∫ x 1 0.5dy , 1 ≤ x < 2 0.75 + ∫ x 2 0.25dy , 2 ≤ x < 3 1 , x ≥ 3 = 0 , x < 0[ 1 2 y2 2 ]x 0 , 0 ≤ x < 1 0.25 + [0.5y]x1 , 1 ≤ x < 2 0.75 + [0.25y]x2 , 2 ≤ x < 3 1 + 0 , x ≥ 3 = 0 , x < 0 x2 4 , 0 ≤ x < 1 0.5x− 0.25 , 1 ≤ x < 2 0.25x+ 0.25 , 2 ≤ x < 3 1 , x ≥ 3 (c) Hence give the quantile function Q(p) corresponding to p(x). This comes from re-arranging part (b). p = 0, x < 0 can ignore since next interval covers p = 0. p = x2 4 → x = √ 4p = 2 √ p, as 0 ≤ x < 1, 0 ≤ p < 14 . p = 0.5x− 0.25→ x = 2(p+ 0.25), as 1 ≤ x < 2, 14 ≤ p < 34 . p = 0.25x+ 0.25→ x = 4(p− 0.25), as 2 ≤ x < 3, 34 ≤ p < 1. So Q(p) = 2 √ p : p ≤ 0.25 1 + 2(p− 0.25) : 0.25 ≤ p ≤ 0.75 2 + 4(p− 0.75) : p ≥ 0.75 (d) Hence, or otherwise, write pseudo-code for an algorithm that will generate a sample from this distribution. Use an inverse sampler with Q(p) above. I.e. Define Q(p) as in (c) Sample x from p(x) as follows 1. sample u uniformly from [0, 1] 2. take x = Q(u). Marks / 8 Page 9 of 43 Question A.4 (2+2+2=6 marks) If E [X] = 1 and E [ X2 ] = 4, E [Y ] = 0 and E [ Y 2 ] = 1, and X and Y are independent, then: (a) Calculate E [ 2X2 + (X + 1)2 ] . = 2E [ X2 ] +E [ X2 + 2X + 1 ] = 2E [ X2 ] +E [ X2 ] + 2E [X] +E [1] = 2 ∗4 + 4 + 2∗1 + 1 = 15 (b) Calculate E [ (X + 1)(Y + 1)2 ] . = E [(X + 1)]E [ (Y + 1)2 ] = E [(X + 1)]E [ Y 2 + 2Y + 1 ] = (E [X] + E [1])(E [ Y 2 ] + 2E [Y ] + E [1]) = (1 + 1)(1 + 2 ∗ 0 + 1) = 4 (c) Calculate V [(X + 1)(Y + 1)]. (According to V [X] = E [ X2 ]− E [X]2) = E [ (X + 1)2(Y + 1)2 ]− E [(X + 1)(Y + 1)]2 = (4 + 2 ∗ 1 + 1)(1 + 2 ∗ 0 + 1)− (1 + 1)2(1)2 = 10 Marks / 6 Page 10 of 43 Question A.5 (3+3+3=9 marks) Consider the probability density function given by a mixture of two Gaussians with identical standard deviation σ, as p(x|ρ, µ1, µ2, σ) = ρN(x|µ1, σ) + (1− ρ)N(x|µ2, σ) where N(·|·) is the probability debsity function of a Gaussian. Thus the expected value of function f(x) under this distribution is given by Eρ,µ1,µ2,σ [f(x)] = ρEN(µ1,σ) [f(x)] + (1− ρ)EN(µ2,σ) [f(x)] where the two expected values on the right hand side are done using Gaussian distributions. (a) What is the mean of x for the mixture of two Gaussians? According to the second formula provided: E1,2 [f(x)] = ρE1 [f(x)] + (1− ρ)E2 [f(x)]. For the mean we use f(x) = x and note E1 [x] = µ1 and E2 [x] = µ2. So E1,2 [x] = ρµ1 + (1− ρ)µ2 (b) What is the mean of x2 for the mixture of two Gaussians? Now f(x) = x2 from the formula sheet V [g(x)] = E [ g(x)2 ]− E [g(x)]2, for g(x) = x→ V [x] = E [x2]− E [x]2 → E [x2] = V [x] + E [x]2 = σ2 + µ2. So, E1,2 [ x2 ] = ρE1 [ x2 ] + (1− ρ)E2 [ x2 ] = ρ(σ2 + µ21) + (1− ρ)(σ2 + µ22) = σ2 + ρµ21 + (1− ρ)µ22 Marks / 6 Page 11 of 43 (c) What is the variance for the mixture of two Gaussians? Recall V [x] = E [ x2 ]− E [x]2, so V1,2 [x] = E1,2 [ x2 ]− E1,2 [x]2 = σ2 + ρµ21 + (1− ρ)µ22 − (ρµ1 + (1− ρ)µ2)2 = σ2 + (ρ− ρ2)(µ1 − µ2)2 Marks / 3 Page 12 of 43 Part B (25 marks in total) Question B.1 (3+2+3=8 marks) You have data x distributed as Poisson with rate λ = 16, so x ∼ Pois(16). (a) Show how to use the central limit theorem to get an approximate value for p(10 ≤ x ≤ 20). Compute the approximate value, noting that the Z tables are only accurate to 2 decimal places. So approximately x ∼ N(16, 16). As shown in the figure, black dots are plotted using the Poisson distribution while the bule line plots the approxi- mate Normal distribution. Since CLT is applied, when we computing the approximate value, we will standard- ise the normal distribution to N(0, 1) by using the formula z = x−µσ . As the original distribution is a discrete one, value 10 will be in the middle of 9.5 and 10.5, 20 will be in the middle of 19.5 and 20.5. Therefore p(10 ≤ x ≤ 20) ≈ p(9.5 ≤ x ≤ 20.5) = p(−6.5/4 ≤ Z ≤ 4.5/4) for Z a standard normal. Thus the answer from the Z tables, which only give a few decimal places, 0.86−0.05 = 0.81. You won’t be penalised if use p(−6.0/4 ≤ Z ≤ 4/4) in the exam. (b) You have a sample of 10 values from this distribution, and compute its mean x. What is an approximate distribution for x? So approximately x ∼ N(16, 16/10). Since it is a Poisson distribution, E [x] = λ = 16, V [x] = λ = 16. Using CLT in the formula sheet, mean(µ) of the distribution is λ = 16, and variance (σ2) is λ = 16 as well. Then the sample mean’s approximate distribution: x ∼ N(µ, 1nσ2) where n = 10. (c) What are 95% confidence intervals for the mean x, according to this approximation? So approximately x ∼ N(16, 16/10). Using Z = 1.96 for confidence intervals, we get [16 − Marks / 5 Page 13 of 43 1.96 √ 16/10, 16 + 1.96 √ 16/10] which is [16− 2.48, 16 + 2.48]. Using the approximation distribution for x in (b): x ∼ N(16, 16/10), as 95% is required, we know α = 0.05→ α/2 = 0.025. Looking up the z-table in the formula sheet for z-values greater than zero, we need to find the z-value z0.025, for which the probability p = 1−α/2 = 1−0.025 = 0.975→ z0.025 = 1.96. So we get [16− 1.96√16/10, 16 + 1.96√16/10]. Marks / 3 Page 14 of 43 Question B.2 (2+5=7 marks) While IQ is considered to have a mean of 100 and standard deviation of 15. You expect students in your masters class will have a higher mean. (a) Given a sample of size 10, compute a one-sided 95% confidence interval in the form (−∞, I] for where the measured mean should lie. The mean has distribution N(100, 152/10). For the one-sided case, Z = 1.64. So the confi- dence interval is (−∞, 100 + 1.64 ∗ 15/√10] which is (−∞, 107.78] From the question, we know it is a normal distribution where µ = 100, σ = 15→ σ2 = 152. With the given sample whose size is 10 and CLT, then we have the mean as N(100, 152/10). For the one-sided case, we should find z0.5 by using the z-table for which the probability p = 1 − α = 1 − 0.05 = 0.95. So here z = 1.64, the confidence interval will be (−∞, 100 + 1.64 ∗ 15/√10]. (b) You get data from 10 students with the form [104, 120, 100, 112, 133, 138, 111, 118, 114, 118]. Note that the mean of the sample is 116.8 and the mean of the squares of the sample is 13765.8. Test the null hypothesis that the students’ IQ has mean 100. Without assuming you know the standard deviation, give the test statistic and the p-value for this data. Note the tables of statistics given at the back of the exam will not allow you to lookup the p-value precisely. Null hypothesis H0 : µ0 = 100 Alternative hypothesis HA : µ0 6= 100 sample standard deviation is S = √ (13765.8− 116.82) ∗ 10/9 = 11.7; since don’t know variance use Student-t; by null hypothesis, tα,9 = (116.8−100)∗ √ 10/S = 4.54 which is the test statistic; p-value is about 0.0005 (not much accuracy though, its a bit more); thus we reject the null hypothesis at the 0.1% level (could choose the 5% level i.e. α = 0.05, it is up to you to choose the decision threshold. In practice we don’t go higher than α = 0.05 and α should ideally be chosen before calculating p. If we set the threshold lower we can be more confident in our decision if the p-value is below this threshold) and conclude that the student’s IQ does not have mean of 100. The sample standard deviation can be calculated quickly by using the formula S = √√√√ n n− 1 ( 1 n n∑ i=1 x2i − ( 1 n n∑ i=1 xi)2 ) . Marks / 7 Page 15 of 43 Question B.3 (2+2+4+2=10 marks) You obtain paired data (X,Y ) with values ~x = [4.59, 4.60, 6.32, 4.85, 3.27, 5.92, 1.92, 6.90, 4.82, 5.39] and ~y = [2.89, 2.46, 3.28, 2.34, 2.11, 3.56, 1.77, 3.29, 2.46, 2.60]. The various sample means (us- ing the above data) are: x = 4.859 y = 2.677 x2 = 25.516 y2 = 7.460 xy = 13.670 (a) What is the correlation co-efficient between X and Y ? What does this tell you about X and Y ? r2xy = SS2XY SSXXSSY Y = (n(xy − x.y))2 n(x2 − x2)n(y2 − y2) = 0.784, so X and Y are highly correlated. (b) Fit a simple linear model to this data in the form Yˆ = β0 + β1X What are your estimates for β0 and β1? βˆ1 = SSXY SSXX = xy − x.y x2 − x2 = 0.348 βˆ0 = y − βˆ1x = 0.988 NB. data was generated with Y = 0.3 ∗X + 1.3 +√(0.3)Z. Marks / 4 Page 16 of 43 (c) What are the standard errors for β0 and β1? From the formula sheet, we have 1√ RSS n(n−2) X2 X2−X2 (βˆ0 − β0) ∼ Student− t(n− 2) 1√ RSS n(n−2) 1 X2−X2 (βˆ1 − β1) ∼ Student− t(n− 2) within which, the square-root terms are the corresponding standard errors (i.e., error in estimating β0, β1). Standard error of βˆ1 is √ RSS n(n− 2) 1 X2 −X2 ; Standard error of βˆ0 is √ RSS n(n− 2) X2 X2 −X2 . And RSS(β0, β1) = n∑ i=1 (yi − β0 − β1xi)2 = n∑ i=1 y2i − β0yi − β1xiyi − β0yi + β20 + β0β1xi − β1xiyi + β0β1xi + β21x2i = n∑ i=1 y2i − 2β0yi − 2β1xiyi + β20 + β21x2i + 2β0β1xi = n ( y2 − 2β0y − 2β1xy + β20 + β21x2 + 2β0β1x ) So RSS(βˆ0, βˆ1) = 0.631 and s.e.(βˆ0) = 0.325 and s.e.(βˆ1) = 0.064. (d) Test the hypothesis the β1 = 0. What is your test statistic and its p-value? What is the outcome of the test? The formula 1√ RSS n(n−2) 1 X2−X2 (βˆ1−β1) ∼ Student−t(n−2) will be used to test the hypothesis: H0 : β1 = 0 H1 : β1 6= 0 This is a linear regression problem, so t-test will be used, n = 10, so n− 2 = 8 tα,n−2 = 1 s.e.(βˆ1) (βˆ1 − β1) = 1 0.064 (0.348− 0) = 5.406 p-value is slightly more than 0.0005, so we reject null hypothesis. Marks / 6 Page 17 of 43 Part C (30 marks in total) Question C.1 (2+2+2=6 marks) You have a data set supplied as real-valued pairs (X,Y ) and you wish to regress X onto Y . You have 2 models: A: a 4 degree polynomial yˆ = 4∑ i=0 aix i B: a 20 degree polynomial yˆ = 20∑ i=0 aix i These questions can be answered based on the contents in Alexandria and One in ten rule. (a) Describe how the bias of models A and B differ. A has higher bias with 5 versus 21 parameters; generally, A should not give as good a fit to the training data (b) Describe how the variance of models A and B differ. B has higher variance because it has more parameters generally, B give as good a closer giit to the training data but it can overfit, and when it does the variance will be higher (c) If you had 100 data points in your sample, which of ther two models would you recom- mend? Justify your answer. by the rule of thumb, with 100 data points, you should fit about 100/10=10 parameters, so safer to go with model A Marks / 6 Page 18 of 43 Question C.2 (5+3+2+2=12 marks) (a) You wish to build a na¨ıve Bayes classifier regressing Booleans A, B and C onto the Boolean X. Someone has already counted the data for you to create frequency tables below: A=0 A=1 B=0 B=1 C=0 C=1 X=0 10 40 30 20 15 35 X=1 30 20 5 45 40 10 Construct probability tables as needed to specify the estimated na¨ıve Bayes classifier for the task. Then give the formula for the classifier and describe how it would be used. First p(X=0) = p(X=1) = 50/100 = 0.5. Then the tables for p(A|X), p(B|X) and p(C|X) respectively are created by normalising the above tables along the rows. A=0 A=1 B=0 B=1 C=0 C=1 X=0 10/50=0.2 40/50=0.8 30/50=0.6 20/50=0.4 15/50=0.3 35/50=0.7 X=1 30/50=0.6 20/50=0.4 5/50=0.1 45/50=0.9 40/50=0.8 10/50=0.2 Now using the unnormalised NBC formula we have p(X|A,B,C) ∝ p(X)p(A|X)p(B|X)p(C|X). To classify a vector (A = a,B = b, C = c) as being in class X = 1 or X = 0, we look up p(X = 1|A = a,B = b, C = c) and p(X = 0|A = a,B = b, C = c) and set the class label to the value of X that gives the highest probability. Marks / 5 Page 19 of 43 (b) Consider the probabilities p(A=0|X=0) and p(B=0|X=1). Compute their standard errors, making any assumptions as needed? What can you say about the resulting estimates? The RVs A and B are binary so treat as Bernoulli with parameter Θ. Now for Bernoulli RVs, the parameter estimate is Θˆ = 1n ∑n i=1 yi where yi correspond to n observations (for A or B) that can be 0 or 1. Now standard error is the defined to be the standard deviation of the sampling distribution of a statistic. From the formula above we see that Θˆ is equal to the sample mean for the Bernoulli RV. So by applying the the sample mean version of the CLT we see that the variance of the sampling distribution of the mean in this case is σ2/n = Θ(1−Θ)/n where Θ(1 − Θ) is the variance for a Bernoulli RV. Now standard deviation (aka standard error for the reason noted above) is the square root of the variance. So we have: s.e.(Θˆ) = σ√ n = √ Θˆ(1− Θˆ) n = √ Θˆ(1− Θˆ) 50 , n = 50 for X = 0 or X = 1. And so: s.e.(p(A = 0|X = 0)) = √ 0.2× 0.8 50 = 0.056 ≈ 6% error; s.e.(p(B = 0|X = 1)) = √ 0.1× 0.9 50 = 0.042 ≈ 4% error. So both estimates are quite poor. (c) Which would be better, the na¨ıve Bayesian classifier or the logistic regression classifier for this data set? Justify your answer. Its not really clear, since they have the same bias. But at a pinch since there is so few data, perhaps the NBC works better because it can be more robust with little data. (d) The first step of the k-means algorithm is to initialise the centroids. Describe a way this could be done, and why it is OK to use it. Many solutions. Pick a random point as first centroid. For the next k− 1 centroids: select 10 random points and select the point furtherest away from the current batch of centroids. It is OK to use this approach because there are many different possibilities to initialise and without any prior knowledge a random initialisation is a good starting point. Moreover, keeping the centroids away from each other initially will also probably lead to a better solution because if we start with them too close together we may end up with more than one centroid per actual cluster as we saw in the lecture. Marks / 7 Page 20 of 43 ↓Page 21 of 43 Question C.3 (6=6 marks) Consider the probability density function given below, defined by p(x) = 2 pi √ 1− (2x− 1)2 : 0 ≤ x ≤ 1 2 pi √ 1− (2x− 3)2 : 1 ≤ x ≤ 2 0 : otherwise This is two semi-circles side-by-side of radius 1/2, then scaled by 4/pi to get a PDF. Page 22 of 43 (a) Devise pseudo-code for a rejection sampler for this distribution. Note the maximum value is marked at 2pi . First we Choose pprop(x) = 1 2 on x ∈ [0, 2] and let q(x) = p(x). Now we need to find C by aligning the maxima of pprop(x) and Cq(x) to minimise the number of samples that will be rejected: 1 2 = Cq(x) at x = 1 2 , this is 1 2 = C · 2 pi → C = pi 4 . Now here is the pseudo-code: Choose pprop(x) = 1 2 on x ∈ [0, 2], C = pi4 and q(x) = p(x), sample x from p(x) as follows: 1. Sample x from pprop(x) 2. Sample U as uniform in [0, 1] 3. Reject x if U > Cq(x) pprop(x) = pi 4 q(x) 1 2 = pi 2 q(x) and return to step 1 4. Otherwise, accept x. Marks / 6 Page 23 of 43 Question C.4 (5+1=6 marks) You wish to build a decision tree to predict a three-valued variable X. The first two features to test are Booleans A and B. Someone has already counted the data for you to create frequency tables below: A=0 A=1 B=0 B=1 X=0 10 40 30 20 X=1 30 20 5 45 X=2 30 20 45 5 (a) Compute and report the quality measure for the attributes A and B using the informa- tion gain metric. Get conditional probabilities by normalising in columns: p(X|A) p(X|B) A=0 A=1 B=0 B=1 X=0 10/70 = 1/7 40/80 = 1/2 30/80 = 3/8 20/70 = 2/7 X=1 30/70 = 3/7 20/80 = 1/4 5/80 = 1/16 45/70 = 9/14 X=2 30/70 = 3/7 20/80 = 1/4 45/80 = 9/16 5/70 = 1/14 Then we need to find H(X|A) and H(X|B) as they determine quality of A and B in terms of information gain defined as H(X)−H(X|Y ) where Y can be A or B. From formula sheet: H(X|Y ) = ∑ y p(Y = y)H(X|Y = y) and H(X|Y = y) = E [ log2 1 p(X|Y = y) ] = ∑ i p(X = xi|Y = y) log2 1 p(X = xi|Y = y) So we need p(A = 0) = 70 150 = 7 15 p(A = 1) = 80 150 = 8 15 p(B = 0) = 80 150 = 8 15 p(B = 1) = 70 150 = 7 15 Page 24 of 43 and conditional probabilities given in table above. Plug conditional probs into RHS of H(X|A = 0), H(X|A = 1), H(X|B = 0), H(X|B = 1). Once you have these, plug the values into RHS of H(X|A) and H(X|B). This gives H(X|A) = 1.48 and H(X|B) = 1.23. Important Note!!!!! If your calculator cannot compute log2(x) directly, then use: log2(x) = log10(x) log10(2) or log2(x) = loge(x) loge(2) (b) Hence say which attribute is recommended to use at the root of the tree? In (a), H(X|A) > H(X|B), so B should be split at the root of the tree, because entropy of B is lower which means the value of Information Gain is larger. Actually don’t need to solve (a) to conclude this since it can be seen in the table above that B is more biased to specific values than A, and so B is a purer prediction of X for the training data. Marks / 6 Page 25 of 43 Blank page for additional answers if needed. Page 26 of 43 Blank page for additional answers if needed. Page 27 of 43
学霸联盟