贝叶斯代写-ST3010/ST4130
时间:2021-06-03
ST3010/ST4130 UNIVERSITY OF WARWICK THIRD YEAR EXAMINATIONS: SUMMER 2020 BAYESIAN STATISTICS AND DECISION THEORY Time allowed: 2 hours Calculators may be used in this examination. ST3010 candidates: Answer THREE questions from Questions 1,2,3 and 4. Do not attempt Question 5. Full marks may be obtained by correctly answering three complete questions. Candidates may attempt all four questions. Marks will be awarded for the best three answers only. ST4130 candidates: Answer TWO questions from Questions 1,2,3,4 AND ALSO answer Question 5. Full marks may be obtained by correctly answering three complete questions. Candidates may attempt all questions. Marks will be awarded for the best two answers to Questions 1,2,3,4 AND the answer to Question 5. Questions begin on the next page. 1 ST3010/ST4130 Question 1. (a) What does it mean for a prior distribution to be conjugate for a given likelihood function? State an advantage of using a conjugate prior. [2] (b) A random variable X has a binomial Bi(N, ✓) probability mass function given by p(x|✓) = ✓ N x ◆ ✓x(1 ✓)Nx A decision maker has a beta Be(↵,) prior with density p(✓) = (↵+ ) (↵)() ✓↵1(1 ✓)1 with mean µ = ↵↵+ . Briefly discuss how you might elicit the parameters (↵,). [2] (c) Use Bayes rule to find his posterior density explicitly as a function of (↵,, x,N) and hence show that this prior is conjugate for binomial sam- pling. [3] (d) Model M1 assumes a uniform prior density with parameters (↵1,1) = (1, 1) whereas model M2 considers a beta prior density (↵2,2) = (N,N). Calculate the posterior mean explicitly under these two di↵erent models as a function of the observed sample proportion x/N and explain the di↵erences between the inferences from these two models. [3] (e) A decision maker believes that with probability ⇡i the modelMi holds which entails using a Be(↵i,i) prior density pi(✓), i = 1, 2. He is sure that one or other of M1 or M2 must hold so that ⇡1 + ⇡2 = 1. He therefore uses a prior density p(✓) from the family P(↵1,1,↵2,2,⇡1,⇡2) whose densities have the form p(✓) = ⇡1p1(✓) + ⇡2p2(✓) where pi(✓) are the beta densities defined above and ⇡1+⇡2 = 1. Prove that P(↵1,1,↵2,2,⇡1,⇡2) is also conjugate under Binomial sampling and find the posterior density explicitly as a function of (↵1,1,↵2,2,⇡1,⇡2, x,N). [7] (f) Define the Bayes Factor between two models in general terms. What is it used for? Calculate the Bayes Factor between the two models M1 and M2 in terms of x and N . [3] [TOTAL: 20 marks] Continued... 2 ST3010/ST4130 Question 2. (a) What does it mean for a probability forecaster of rain tomorrow to be well calibrated over a period of n days? Give an example where a forecast is of little value even if it is well calibrated. [2] (b) Demonstrate which of the following three forecasters are empirically well calibrated. [3] Quoted probability q of rain tomorrow 0.0 0.25 0.5 0.75 1.0 Rainy days when F1 quoted q 0 50 0 475 0 Total days when F1 quoted q 0 400 0 600 0 Rainy days when F2 quoted q 0 50 250 225 0 Total days when F2 quoted q 0 200 500 300 0 Rainy days when F3 quoted q 0 0 275 0 250 Total days when F3 quoted q 200 0 550 0 250 (c) Define a proper scoring rule and the Brier score on a quoted probability q of rain. What is the advantage of using a proper scoring rule to elicit a probability? Prove that the Brier score is a proper scoring rule. [6] (d) Write down the formula for the empirical Brier score in this example. Calculate these scores for the three forecasters above over this period of 1,000 days to discover who has the best empirical Brier score. [5] (e) State how an observer can improve the empirical Brier score of a forecaster by retrospectively reinterpreting her quoted forecast. What is this process called? Calculate this improvement explicitly for F1. Given F1’s forecast of rain for tomorrow, can you think of a circumstance where it might still be better to use F1’s actual forecast probability rather than its adjusted value. [4] [TOTAL: 20 marks] Continued... 3 ST3010/ST4130 Question 3. (a) Consider a Bayesian Network on six random variables {X1, X2, ..., X6} which assumes for i = 2, 3, ..., 6 the irrelevance statements Xi q Ri|Qi where (Ri, Qi) is a partition of {X1, ..., Xi1}. Define the vertices and edges of a DAG of this Bayesian Network. When is a DAG said to be valid? [2] (b) Consider the DAG G1 given below: X2 ! X3 ! X6 % # " X1 ! X4 X5 Write down the conditional irrelevance statements corresponding to the DAG G1, using for i = 2, 3, ..., 6 the format Xi q Ri|Qi where (Ri, Qi) is a partition of {X1, ..., Xi1} which you need to specify. [2] (c) Consider the DAG G2 given below: X2 ! X3 ! X6 " % # & " X1 ! X4 X5 Write down the irrelevance statements corresponding to the DAG G2 in the same format as in the previous question. [2] (d) Define the weak union property of conditional irrelevance. Use this property to show that if G1 is valid, then G2 is also valid. What graphical property of a DAG can be used to prove in a di↵erent way that if G1 is valid then G2 is also valid? [4] (e) State the d-separation theorem, defining all terms. [3] (f) Use this theorem to check which, if either, of the two irrelevance statements X1 qX5|X4 and X2 qX5|X6 can be deduced from the valid DAG G1. [2] (g) What does it mean for a DAG to be decomposable? Is G1 decomposable? Is G2 decomposable? [1] (h) Identify the cliques and separators of G2 and draw its junction tree. Briefly describe how you would use this junction tree to propagate the information that X5 = x5 to revise all the clique probability tables of G2. [4] [TOTAL: 20 marks] Continued... 4 ST3010/ST4130 Question 4. (a) What is the Certainty Money Equivalent of a monetary lottery? Describe how the midpoint method can be used to elicit a Bayesian decision maker’s utility function U over a one dimensional reward r when rewards have a lower and upper bound. How should this utility function be used to identify a Bayes decision? [5] (b) A decision maker’s utility function U1(d, r) takes the form U1(d, r) = 1 exp (r(d)) where > 0 and r(d) denotes the reward that he obtains from taking a decision d. He is indi↵erent between obtaining a reward r0 with certainty and a lottery giving him a reward r0 + h, h > 0, with probability ↵() and a reward r0 h with a probability 1 ↵(). Express ↵() as an explicit function of and h which does not depend on r0. How would you describe risk behaviour of this decision maker? [5] (c) A decision maker has a utility function U1(d, r) of the form described above. He can take one of two decisions: a decision d1 which gives a financial reward which is normally distributed with mean 3 and variance 2 and a decision d2 which gives as normally distributed reward with mean 4 and variance 4. Find the values for which this client will find d1 preferable to d2. You may want to use the fact that if X has a normal distribution with mean µ and variance 2 then its moment generating function is E(exp (tX)) = exp ✓ µt+ 1 2 2t2 ◆ [5] (d) Define what it means for a decision maker’s utility function U to have three value independent attributes (x1, x2, x3). What algebraic form must U then take? [2] (e) Suppose attribute x1 has conditional utility U1(d, x1) given above where it is known that x1 0 and that the two attributes x2 and x3 are both binary. Briefly describe how you would elicit U . [3] [TOTAL: 20 marks] Continued... 5 ST3010/ST4130 Question 5. A psychologist wants to investigate whether playing a violent video game (X1 = 1) or not (X1 = 0) on day t has a causative e↵ect on the antisocial behaviour of a teenager on day t + 1. Let X2 (measured on a discrete scale) denote the teenager’s testosterone level on the morning of day t, and X3 his level of testosterone on the morning of day t+1 (measured on the same scale asX2). Let X4 be an indicator of whether the teenager is involved in an antisocial incident (X4 = 1) or not (X4 = 0) on day t + 1. The psychologist proposes that the following DAG is valid: X2 ! X3 % # X1 ! X4 (a) State the two irrelevance hypotheses that the DAG embodies and explain them in English. [2] (b) Now suppose that the psychologist believes this DAG also to be a causal Bayesian Network. Explain the meaning of the notation p(x2, x3, x4||x1). Explicitly state a formula for p(x2, x3, x4||x1), as well as for p(x1, x3, x4||x2) and p(x1, x2, x4||x3) and p(x1, x2, x3||x4). [3] (c) Explain in words what beliefs would allow us to use the formula for p(x1, x3, x4||x2). [3] (d) Write down the formula for the total cause p(x4||x1) of X1 on X4. Can this total cause be identified from the marginal distribution of (X1, X2, X4)? [3] (e) Write down the formula for the total cause p(x4||x3) of X3 on X4. Show that this total cause is not identified from the marginal distribution of (X2, X3, X4). [3] (f) You know that there is always a direct relationship between X3 and X4 in the idle system, so the edge from X3 to X4 has to remain. Give two other edges from the graph that if removed would allow us to identify the total cause of X3 on X4 from the observed margin on (X2, X3, X4). State the corresponding irrelevance condition that would be assumed in each case. Calculate the formulae for this total cause in each case. [6] [TOTAL: 20 marks] End. 6 1. (a) If a prior is conjugate, we have that the posterior distribution lies in the same family as the prior distribution whatever the value an obser- vation takes. [bookwork, 1 mark] Speed of calculation, convenience or transparency of the learning formulae are all advantages of using a conjuate prior. [bookwork, 1 mark] (b) Several answers are possible here. I expect that most will say they will elicit the prior mean µ using a proper scoring rule and then ↵+ as the strength of this information in terms of the number of sample point equivalents. [bookwork, 2 marks] (c) Using Bayes Rule, we write the posterior of p(✓|x) as a function of ✓: p(✓|x) _ p(x|✓)p(✓) _ ✓x(1 ✓)Nx✓↵1(1 ✓)1 = ✓↵ ⇤1(1 ✓)⇤1 where ↵⇤ = ↵+x,⇤ = +N x. Since this posterior density is pro- portional to a Be(↵⇤,⇤) density it must be equal to it. [bookwork, 3 marks] (d) The posterior means µ⇤ are equal to µ⇤ = ↵+ x ↵+ +N For model M1 we have: µ⇤ = 1 + x 2 +N For model M2 we have: µ⇤ = N + x 3N When N > 1 the second prior which has a smaller prior variance than the first will shrink the estimate towards the prior mean more than the first [direct application of bookword, 3 marks] (e) By Bayes rule p(✓|x) _ p(x|✓)p(✓) = p(x|✓)[⇡1p1(✓) + ⇡2p2(✓)] = ⇡1p(x|✓)p1(✓) + ⇡2p(x|✓)p2(✓) = ⇡1p1(x)p1(✓|x) + ⇡2p2(x)p2(✓|x) where pi(x) are the marginal likelihoods of Mi, i = 1, 2 and pi(✓|x) are the beta posterior densities associated with priors pi(✓) so that ↵⇤i = ↵i + x, ⇤ i = i +N x 1 i = 1, 2. Since this posterior density must integrate to unity we must therefore have that p(✓|x) = ⇡⇤1p1(✓|x) + ⇡⇤2p2(✓|x) where pi(✓|x) is a Be(↵⇤i ,⇤i ) density, ⇡⇤i = ⇡1p1(x) ⇡1p1(x) + ⇡2p2(x) and pi(x) = ✓ N x ◆ (↵i + i)(↵i + x)(i +N x) (↵i)(i)(↵i + i +N) So this family is conjugate. [new application, 1 mark for start- ing the proof with the Bayes rule, 3 marks for the rest of the proof, and 3 marks for correctly stating the posterior density.] (f) The Bayes Factor is used to select between two models and defined as: p(x|M1) p(x|M2) [bookwork, 1 mark] Here this is equal to: p1(x) p2(x) = (↵1 + 1)(↵1 + x)(1 +N x)(↵2)(2)(↵2 + 2 +N) (↵2 + 2)(↵2 + x)(2 +N x)(↵1)(1)(↵1 + 1 +N) = (2)(1 + x)(1 +N x)(N)(N)(3N) (2N)(N + x)(2N x)(1)(1)(2 +N) = (1 + x)(1 +N x)(N)(N)(3N) (2N)(N + x)(2N x)(2 +N) [new, 2 marks] 2 2. (a) A forecaster is said to be well calibrated over a set of n time periods if over the set of periods he quotes the probability of rain as q the proportion bq of rainy days is q: this being true for all probabilities he quotes. [bookwork, 1 mark] A forecast is of little value if the same probability q is quoted for every days, even if q really is the proportion of days on which occurs. [comprehension, 1 mark] (b) F1 is not well calibrated since for example bq2(0.25) = 0.125 6= 0.25 [direct application, 1 mark] Forecaster F2 is well calibrated since bq2(.25) = 50 200 = .25 bq2(.5) = 250 500 = .5 bq2(.75) = 225 300 = .75 [direct application, 1 mark] Forecaster F3 is well calibrated since bq3(0) = 5 200 = 0 bq3(0.5) = 275 550 = .5 bq3(3) = 250 250 = 1 [direct application, 1 mark] (c) A proper scoring rule L(a, q) is such that the expected score L(q|p) is uniquely minimised when q = p. [bookwork, 1 mark] The Brier score is defined as L(a, q) = (a q)2 [bookwork, 1 mark] A proper scoring rule encourages the rational forecaster to quote her true probability. [bookwork, 1 mark] Proof that the Brier score is a proper scoring rule: L(q|p) = p(1 q)2 + (1 p)q2 = p 2pq + q2 This can be rearranged into: L(q|p) = (q p)2 + p(1 p) This is clearly minimised in q when q = p. Alternatively can consider when the derivate of L(q|p), equal to 2p+2q, is equal to zero, which happens for q = p. [bookwork, 3 marks] 3 (d) The empirical Brier score in this example is S = P q xq(1 q)2 + yqq2 where xq is the number of times it rained and yq the number of times it didn’t when he quotes q. [bookwork, 1 mark] S1 = 50 9 16 + 350 1 16 + 475 1 16 + 125 9 16 = 175⇥ 9 + 825 16 = 150 S2 = 50 9 16 + 150 1 16 + 500 1 4 + 225 1 16 + 75 9 16 = 125⇥ 9 + 2000 + 375 16 = 218.75 S3 = 550 1 4 = 137.5 The best score is the lowest which is obtained by F1. [new applica- tion, 4 marks] (e) The empirical score always improves if the observer substitutes the forecaster’s forecast by the observed sample proportion. The is called recalibration. [bookwork, 1 mark] The improvement is then P q n(q)(q bq)2. In this example this is 400(.125)2 + 600(0.04167)2 = 6.25 + 1.042 = 7.292 [new application, 2 marks] Recalibration would be inappropriate for example if the weather fore- caster was still learning and so adjusting their performance over the course of the 1000 days, or if for any reason the forecast for tomorrow is not produced in the same way as previously. [comprehension, 1 mark] 4 3. (a) A DAG (directed acyclic graph) of this Bayesian Network represents graphically the irrelevance statements assumed in the model. The DAG has vertices labelled {X1, X2, . . . , X6} and has an edge from Xj to Xi with j < i if and only if Xj 2 Qi. A DAG is valid if and only if the corresponding irrelevance statements are true. [bookwork, 2 marks] (b) X2 qX1 X3 q ;|(X1, X2) X4 qX2|(X1, X3) X5 q (X1, X2, X3, X4) X6 q (X1, X2, X4)|(X3, X5) Note that the second statement is degenerate and could be omitted since X3 depends on both X1 and X2. [direct application, 2 marks] (c) X2 q ;|X1 X3 q ;|(X1, X2) X4 qX2|(X1, X3) X5 q (X1, X2, X4)|X3 X6 q (X1, X2, X4)|(X3, X5) Note that first and second statement are degenerate and could be omitted. [direct application, 2 marks] (d) The weak union property states that: X q YW |Z ) X qW |ZY [bookwork, 1 mark] The irrelevance statements above for G1 and G2 are identical except for X2 and X5. But using weak union we have: X2 qX1 ) X2 q ;|X1 and X5 q (X1, X2, X3, X4)) X5 q (X1, X2, X4)|X3 It follows that ifG1 is valid thenG2 is also valid. [direct application, 2 marks] It is always possible to add edges to a DAG and retain validity, because the absence (rather than presence) of edges is making a statement. Since G2 is the same as G1 with two added edges, we can deduce that G1 valid implies G2 valid. [bookwork, 1 mark] 5 (e) The d-separation theorem states that for any there disjoint setsX,Y, Z whose valid DAG is G the deduction X q Y |Z is valid if in the skeleton of the moralised ancestral graph of X,Y, Z we have that Z separates X from Y (ie any path from a vertex in X to a vertex in Y passes through a vertex in Z). The ancestral graph of X,Y, Z is the subgraph generated by keeping only the ancestors of (X,Y, Z). The moralised version of a directed graph is the mixed graph where an undirected edge connects any two unconnected parents. The skeleton of a graph is constructed by replacing any directed edge in by an undirected one. [bookwork, 3 marks]. (f) The undirected moralised version of the ancestral set associated with X1 qX5|X4 is X2 X3 | | X1 X4 X5 so this is valid whilst the undirected moralised version of the ancestral set associated with X2 qX5|X6 is X2 X3 X6 | | | X1 X4 X5 which exhibits the path (X2, X3, X5) not passing through X6. So this statement cannot be deduced from G1. [direct application, 2 marks] (g) A DAG is decomposable if any only if all pairs of parents of a same node are connected. G1 is not decomposable but G2 is decomposable. [bookwork, 1 mark] (h) The cliques are C[1] = {X1, X2, X3}, C[2] = {X1, X3, X4} and C[3] = {X3, X5, X6}. The separator between C[1] and C[2] is {X1, X3} and the separator between C[2] and C[3] is {X3} and one junction tree is C[1] C[2] C[3]. [application, 2 marks] Update C[3] using usual conditioning rule and deduce the separator between C[3] and C[2]. Now update the clique margins p(C[2]) adja- cent to C[3] in the junction tree to p+(C[2]) where p+(C[2]) = p+(X1, X3, X4) = p(C[2]) p+(X3) p(X3) Then update p(C[1]) using p+(C[1]) = p+(X1, X2, X3) = p(C[1]) p+(X1, X3) p(X1, X3) where p(X1, X3) is given a priori and p+(X1, X3) is calculated from p+(C[2]) = p+(X1, X3, X4). [application, 2 marks] 6 4. (a) The CME of a lottery is the maximum amount a DM is prepared to forfeit in order to enter that lottery. [bookwork, 1 mark] Explanation of the midpoint method. Elicit from the DM the mini- mum r[0] and maximum r[1] that can be won from any decision. Ask for the CME r[0.5] for the lottery giving r[0] with probability 12 and r[1] with probability 12 . Then ask to give CME r[0.25] for the lottery giving r[0] with probability 12 and r[0.5] with probability 1 2 and CME r[0.75] for the lottery giving r[0.5] with probability 12 and r[1] with probability 12 . Continue in this way deriving CME r[ 1 2 (x + y)] for the lottery giving r[x] with probability 12 and r[y] with probability 1 2 where x < y are adjacent in the set of currently evaluated points. Usually no more than 7 such interior rewards need to be evaluated. Now simply set U(r[x]) = x for all the elicited CME rewards and linearly interpolate the others. [bookwork, 3 marks] A Bayes decision is a decision which minimises the expectation of the utility function over the lottery probabilities of that decision. [bookwork, 1 mark] (b) The expected utility of the CME must be equal to the expected utility of this gamble. So 1 exp(r) = ↵()(1 exp((r + h))) + (1 ↵())(1 exp((r h))) , exp(r) = ↵() exp((r + h)) + (1 ↵()) exp((r h)) , 1 = ↵() exp(h) + (1 ↵()) exp(h) , 1 exp(h) = ↵()(exp(h) exp(h)) , ↵() = 1 exp(h) exp(h) exp(h) So ↵() is not a function of r. [application, 4 marks] This decision maker is risk averse since his utility function is concave. [textbook, 1 mark] (c) d1 d2 if the expected utility of the first is strictly greater than the second. Thus using the given identity E (1 exp {R(d1)}) > E (1 exp {R(d2)}) exp 22 4 > exp2 3 22 4 > 2 3 2 > () > 1 [new, 4 marks] The larger the parameter the larger the risk aversion. If > 1 then the DM prefers d1 because although the expected reward is lower, the probability of coming away with a small reward is higher. [comprehension, 1 mark] 7 (d) Three attributes are value independent if two decisions are equally preferred whenever they have identical marginal distributions over all their attributes. If this is the case then the utility function must take the form U(x) = 3X i=1 kiUi(xi) where P3 i=1 ki = 1, ki > 0,and the conditional utilities Ui(xi) are a function only of xi, i = 1, 2, 3 and have support [0, 1]. [bookwork, 2 marks] (e) Elicit for U1(xi) by discovering ↵() using some indi↵erence lottery for a small h above fixing xi, i = 2, 3 to an arbitrary value. The other marginal utilities are degenerate and so do not need eliciting. Elicit criterion weights either by finding the value of ki, i = 2, 3 for which the DM is indi↵erent between obtaining the best possible reward for attribute i and the worst possible rewards associated with the two other attributes and a lottery giving the best of all three attributes with probability ki and the worst of all possible attributes with probability 1 ki. (alternatively could explain how to use the exchange rate method) [application, 3 marks] 8 5. (a) X2 q X1 states that whether the teenager played the game would not depend on his prior testosterone level. [direct application of bookwork, 1 mark] X4 qX2|X1, X3 the testosterone level before playing the game gives no additional relevant information about the teenager’s inclination to antisocial behaviour provided that we happened to know both whether he played the game and his current testosterone levels. [direct ap- plication of bookwork, 1 mark] (b) p(x2, x3, x4||x1) represents the joint distribution of X2, X3, X4 after manipulation of X1 to take value x1. [bookword, 1 mark] p(x2, x3, x4||x1) = p(x2)p(x3|x1, x2)p(x4|x1, x3) p(x1, x3, x4||x2) = p(x1)p(x3, x4|x1, x2) p(x1, x2, x4||x3) = p(x1)p(x2)p(x4|x1, x3) p(x1, x2, x3||x4) = p(x1)p(x2)p(x3|x1, x2) [new application of bookwork, 2 marks] (c) We believe that forcing to play the game will not a↵ect earlier testos- terone levels. So they will have a distribution identical as if they had voluntarily chosen to play the game. Also the predilection to subsequent antisocial behaviour would be the same as if he had chosen to play the game. Similarly forcing him not to play the game would in a similar way not a↵ect X1 and be the same as conditioning for X3 and X4. [comprehension, 3 marks] (d) p(x4||x1) = X x2 p(x2) X x3 p(x3|x1, x2)p(x4|x1, x2, x3) = X x2 p(x2)p(x4|x1, x2) = X x2 p(x1, x2, x4) p(x1) = p(x1, x4) p(x1) = p(x4|x1) Because all terms in this formula are accessible from the (X1, X2, X4) margin this total cause is identified. [application, 3 marks] (e) p(x4||x3) = X x2 p(x2) X x1 p(x1)p(x4|x1, x3) = X x1 p(x1)p(x4|x1, x3) = X x1 p(x1, x3, x4) p(x3|x1) 9 Without observing X1, the weights p(x3|x1)1 in this sum are com- pletely arbitrary positive numbers satisfying P x1 p(x3|x1)p(x1) = p(x3) for unknown probabilities p(x1). This total cause is therefore not identified for the margin of (X2, X3, X4).[comprehension, 3 marks] (f) A first option is to remove the edge from X1 to X3. This corresponds to the irrelevance statement X3qX1|X2 which combined with original assumption X1 qX2 implies that X1 qX3. In this case we have: p(x4||x1) = X x1 p(x1)p(x4|x1, x3) = X x1 p(x1|x3)p(x4|x1, x3) = p(x4|x3) which we can calculate from the margin on (X2, X3, X4). [applica- tion, 3 marks] A second option would be to remove the edge from X1 to X4. This corresponds to the irrelevance statement X4qX1|X3. We would then have: p(x4||x1) = X x1 p(x1)p(x4|x1, x3) = X x1 p(x1)p(x4|x3) = p(x4|x3) and again we have the same formula. [application, 3 marks] 10



























































































































































































































































































































































































































































































































































































































































学霸联盟


essay、essay代写