机器学习代写-机器学习课程|学霸联盟

机器学习代写-机器学习课程

时间：2020-11-26

Week 11 exercises This is the eighth and last page of assessed questions, as described in the background notes. These questions form 70% of your mark for Week 11. The introductory questions in the notes and the Week 11 discussion group task form the remaining 30% of your mark for Week 11. Unlike the questions in the notes, you’ll not immediately see any example answers on this page. However, you can edit and resubmit your answers as many times as you like until the deadline (Friday 04 December 4pm UK time, UTC). This is a hard deadline: This course does not permit extensions and any work submitted after the deadline will receive a mark of zero. See the late work policy. Queries: Please don’t discuss/query the assessed questions on hypothesis until after the deadline. If you think there is a mistake in a question this week, please email Iain. Please only answer what’s asked. Markers will reward succinct to-the-point answers. You can put any other observations in the “Add any extra notes” button (but this is for your record, or to point out things that seemed strange, not to get extra credit). Some questions ask for discussion, and so are open-ended, and probably have no perfect answer. For these, stay within the stated word limits, and limit the amount of time you spend on them (they are a small part of your final mark). Feedback: We’ll return feedback on your submission via email by Friday 11 December. Good Scholarly Practice: Please remember the University requirements for all assessed work for credit. Furthermore, you are required to take reasonable measures to protect your assessed work from unauthorised access. For example, if you put any such work on a public repository then you must set access permissions appropriately (permitting access only to yourself). You may not publish your solutions after the deadline either. 1 The Laplace approximation In note w10b, the second question was about inferring the parameter λ of a Poisson distribution based on an observed count r. You found the Laplace approximation to the posterior over λ given r. Now reparameterize the model in terms of ` = log λ. After performing the change of variables1 , the improper prior on log λ becomes uniform, that is p(`) is constant. a) Find the Laplace approximation to the posterior over `=log λ. Include the derivation of your result in your answer. [10 marks] [The website version of this note has a question here.] b) Which version of the Laplace approximation is better? To help answer this question, we recommend that you plot the true and approximate posteriors of λ and ` for different values of the integer count r. However, we don’t want your code or plots. Describe the relevant findings in no more than 150 words. [15 marks] [The website version of this note has a question here.] 2 Classification with unbalanced data Classification tasks often involve rare outcomes, for example predicting click-through events, fraud detection, and disease screening. We’ll restrict ourselves to binary classification, y ∈ {0, 1}, where the positive class is rare: P(y=1) is small. 1. Some review of how probability densities work: Conservation of probability mass means that: p(`)d` = p(λ)dλ for small corresponding elements d` and dλ. Dividing and taking limits: p(`) = p(λ)|dλ/d`|, evaluated at λ = exp(`). The size of the derivative |dλ/d`| is referred to as a Jacobian term. MLPR:w11q Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1 We are likely to see lots of events before observing enough rare y = 1 events to train a good model. To save resources, it’s common to only keep a small fraction f of the y = 0 ‘negative examples’. A classifier trained naively on this sub-sampled data would predict positive labels more frequently than they actually occur. For Bayes classifiers we could set the class probabilities based on the original class counts (or domain knowledge). This question explores what to do for logistic regression. We write that an input-output pair occurs in the world with probability P(x, y), and that the joint probability of an input-output pair (x, y) and keeping it (k) is P(x, y, k). Because conditional probabilities are proportional to joint probabilities, we can write: P(y | x, k) ∝ P(x, y | k) ∝ P(x, y, k) = (P(x, y) y = 1 f P(x, y) y = 0. (1) A general result that you may quote in answering this question is that for any binary probability distribution: P(z) ∝ ( c z = 1 d z = 0, (2) we can write P(z=1) = σ(log c log d), where σ(a) = 1/(1 + e a). a) We train a logistic regression classifier, with a bias weight, to match subsampled data, so that P(y=1 | x, k) ≈ σ(w>x + b). i) Use the above results to argue that we should add log f to the bias parameter to get a model for the real-world distribution P(y | x) ∝ P(y, x). [15 marks] [The website version of this note has a question here.] ii) Explain whether we have changed the bias in the direction that you would expect. Write no more than one sentence. [5 marks] [The website version of this note has a question here.] b) Consider the following approach. We may wish to minimize the loss: L = EP(x,y)[log P(y | x)] . Multiplying the integrand by 1= P(x, y | k) P(x, y | k) changes nothing, so we can write: L = Z ∑y P(x, y | k) P(x, y) P(x, y | k) log P(y | x) dx = Ep(x,y | k) P(x, y) P(x, y | k) log P(y | x) ≈ − 1N N∑n=1 P(x(n), y(n)) P(x(n), y(n) | k) log P(y(n) | x(n)), where x(n), y(n) come from the subsampled data. This manipulation is a special case of a trick known as importance sampling (see note w11a). We have converted an expectation under the original data distribution, into an expectation under the subsampling distribution. We then replaced the formal expectation with an average over subsampled data. We can use the same idea to justify multiplying the gradients for y=0 examples by 1/ f , when training a logistic regression classifier on subsampled data. MLPR:w11q Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2 P(x, y | k) was defined up to a constant. Thus the “importance weights” applied to each log probability are: P(x, y) P(x, y | k) ∝  P(x,y) P(x,y) y = 1 P(x,y) f P(x,y) y = 0 ∝ (1 y = 1 1/ f y = 0. (3) We can substitute these values of 1 or 1/ f into the loss. The unknown constant scales the loss surface, but does not change where the minimum is. The loss for a positive example is then log P(y=1 | x) as usual, whereas the loss for a negative example becomes (1/ f)log P(y=0 | x). The gradients for negative examples are then also scaled by (1/ f). This method says that if we throw away all but 1/1000 of the negative examples, we should up-weight the effect of each remaining negative example by 1000× to make up for it. Compare this approach to the approach from part a) for training models based on subsampled data for binary classification, giving pros and cons of each. Write no more than 150 words. [25 marks] [The website version of this note has a question here.]