COMP90051-python代写
时间:2023-05-17
COMP90051 Statistical Machine Learning
Project 2 Description1
Due date: 11:00am Thursday, 25 May 2023 Weight: 25%; forming combined hurdle with Proj1
Copyright statement: All the materials of this project—including this specification and code skeleton—are
copyright of the University of Melbourne. These documents are licensed for the sole purpose of your assessment
in COMP90051. You are not permitted to share or post these documents online.
Academic misconduct: You are reminded that all submitted Project 2 work is to be your own individual
work. Automated similarity checking software will be used to compare all submissions. It is University policy
that academic integrity be enforced. For more details, please see the policy at http://academichonesty.
unimelb.edu.au/policy.html. Academic misconduct hearings can determine that students receive zero for an
assessment, a whole subject, or are terminated from their studies. You may not use software libraries or code
snippets found online, from friends/private tutors, or anywhere else. You can only submit your own work.
Multi-armed bandits (MABs) are a simple yet powerful framework for sequential decision-making under
uncertainty. One of the many applications of MABs was pioneered by Yahoo! News, in deciding what news
items to recommend to users based on article content, user profile, and the historical engagement of the user
with articles. Given decision making in this setting is sequential (what do we show next?) and feedback is only
available for articles shown, Yahoo! researchers observed a perfect formulation for MABs like those (ϵ-Greedy
and UCB) discussed in Lecture 17 and Workshop Week 10.2 With {0, 1}-valued rewards representing clicks,
the per round cumulative reward represents click-through-rate, which is exactly what online services want to
maximise to drive user engagement and advertising revenue. In this project, you will work individually to
implement several MAB algorithms. These are a little more advanced than what was covered in class, coming
from real research papers that you will have to read and understand yourself. You will be marked on your
submitted code. By the end of the project you should have developed:
ILO1. A deeper understanding of the MAB setting and common MAB approaches;
ILO2. An appreciation of how MABs are applied;
ILO3. Demonstrable ability to implement ML approaches in code; and
ILO4. Ability to understanding recent ML papers, understand their focus, contributions, and algorithms enough
to be able to implement and apply them. (While ignoring details not needed for your task.)
Overview
The project consists of three tasks.
1. MABs with delayed rewards (Lancewicki et al., 2021) [15 marks]
2. Thompson sampling MAB [4 marks]
3. Doubly-adaptive Thompson sampling MAB (Dimakopoulou, Ren & Zhou, 2021) [6 marks]
1Special thanks to Neil Marchant for generously co-designing this project.
2Note that the skeleton for this project may deviate from the code in Workshop Week 10.
1
All tasks are to be completed in the provided Python Jupyter notebook proj2.ipynb.3 Detailed instructions
for each task are included in this document. These tasks may require you to consult one or more academic
papers. We provide helpful pointers (with margin symbol ) throughout this project spec to guide your reading
and to correct any ambiguities.
MAB algorithms. The project’s tasks require you to implement MAB algorithms by completing provided
skeleton code in proj2.ipynb. All of the MAB algorithms must be implemented as sub-classes of a base MAB
class (defined for you). This ensures all MAB algorithms inherit the same interface, with the following methods:
• __init__(self, n_arms, ..., rng): Initialises the MAB with the given number of arms n_arms and
pseudo-random number generator rng. Arms are indexed by integers in the set {0, . . . , n_arms− 1}. All
MAB algorithms take additional algorithm-specific parameters in place of ‘...’. Hint: When instantiating
a MAB, you should set rng to be the random generator initialised for you in the notebook. This will ensure
your results are reproducible.
• play(self): Plays an arm. The method must return the integer index of the played arm in the set
{0, . . . , n_arms− 1}. Note: this method should not update the internal state of the MAB.
• update(self, arm, rewards): Updates the internal state of the MAB after playing an arm with integer
index arm receiving a real-valued reward.
Your implementations must conform to this interface. You may implement some functionality in the base
MAB class if you desire—e.g., to avoid duplicating common functionality in each sub-class. Your classes may also
use additional private methods to better structure your code. And you may decide how to use class inheritance.
Evaluating MAB classes. Each task directs you to evaluate your implementations in some way. You will
need to use the offline_eval function and dataset file provided. See Appendix A for details.
Python environment. You must use the Python environment used in workshops to ensure markers can
reproduce your results if required. We assume you are using Python ≥ 3.8, NumPy ≥ 1.19.0, SciPy ≥ 1.5.0
and Matplotlib ≥ 3.2.0.
Other constraints. You may not use functionality from external libraries/packages, beyond what is imported
in the provided Jupyter notebook highlighted here with margin symbol . You must preserve the structure
of the skeleton code—please only insert your own code where specified. You should not add new cells to the
notebook. You may discuss the bandit learning slide deck or Python at a high-level with others, including via
Ed, but do not collaborate with anyone on direct solutions. You may consult resources to understand bandits
conceptually, but do not make any use of online code whatsoever. (We will run code comparisons against online
partial implementations to enforce these rules. See ‘academic misconduct’ statement above.)
Submission Checklist
You must complete all your work in the provided proj2.ipynb Jupyter notebook. When you are
ready to submit, follow these steps. You may submit multiple times. We will mark your last attempt.
1. Restart your Jupyter kernel and run all cells consecutively.
2. Ensure outputs are saved in the ipynb file, as we may choose not to run your notebook when grading.
3We appreciate that while some have entered COMP90051 feeling less confident with Python, many workshops so far have
exercised and built up basic Python and Jupyter knowledge. Both are industry standard tools for the data sciences.
2
3. Rename your completed notebook from proj2.ipynb to username.ipynb where username is your uni-
versity central username4.
4. Upload your submission to the Project 2 Canvas page.
Hint: it is a good idea to submit early as a backup. Try to complete Tasks 1.1 and 1.2 in the first week and
submit your (intermediate) notebook; it will help you understand other tasks and be a fantastic start!
Marking
Projects will be marked out of 25. Overall approximately 50%, 30%, 20% of available marks will come from
correctness, code structure & style, and experimentation. Markers will perform code reviews of your submissions
with indicative focus on the following. We will endeavour to provide (indicative not exhaustive) feedback.
The purpose of our feedback is to help you learn, and not to justify marks.
1. Correctness: Faithful implementation of the algorithm as specified in the given reference or clarified in
the specification with possible updates in the Canvas changelog. It is important that your code performs
other basic functions such as: raising errors if the input is incorrect, working for any dataset that meets
the requirements (i.e., not hard-coded), etc.
2. Code structure and style: Efficient code (e.g., making use of vectorised functions, avoiding recalculation
of expensive results); self-documenting variable names and/or comments; avoiding inappropriate data
structures, duplicated code and illegal package imports.
3. Experimentation: Each task you choose to complete directs you to perform some experimentation with
your implementation, such as evaluation or comparison.
Late submission policy. Late submissions will be accepted to 4 days at −2.5 penalty per day or part day.
Task Descriptions
You do not have to attempt all tasks. For example, you may choose to complete Task 1, Tasks 1–2, or Tasks 1–3.
Task 1: MABs with delayed rewards [15 marks total]
The standard setup for MABs assumes rewards are received at the end of each round, before the next arm
is played. However in some applications, rewards may not be received immediately, and are instead received
after a finite (or infinite) delay. For example, in content recommendation, the MAB might immediately receive
feedback when a user clicks on an item, however feedback for not clicking on an item might not be received until
much later when the user’s session ends. In this Task 1 of the project, you will implement MAB algorithms for
the delayed rewards setting from the following paper:
Tal Lancewicki, Shahar Segal, Tomer Koren and Yishay Mansour. ‘Stochastic Multi-Armed Ban-
dits with Unrestricted Delay Distributions’, in Proceedings of the 38th International Conference
on Machine Learning (ICML’21), pp. 5969–5978. 2021. http://proceedings.mlr.press/v139/
lancewicki21a/lancewicki21a.pdf
There are four sub-tasks to complete. Before starting the sub-tasks, we recommend skimming Section 1 of
Lancewicki et al. (2021) for context, then carefully reading Section 2 (especially Protocol 1) where the delayed
rewards setting is defined.
4Canvas/UniMelb usernames look like brubinstein, not to be confused with email such as benjamin.rubinstein.
3
Sub-task 1.1: Adapting offline evaluation
In this sub-task, you will implement a function for evaluating MABs in the delayed rewards setting. Your
function will conduct the evaluation offline (a.k.a. off-policy) using a previously-collected dataset of uniformly-
random arm pulls and resulting rewards. This is in contrast with online evaluation, where the MAB is let
loose in a real environment, choosing arms to play and receiving rewards in real time. Offline evaluation was a
major breakthrough for industry deployments of MABs, as it allows many MABs to be evaluated on a single
historical dataset without exposing poorly-performing MABs to users (recall that a MAB will often begin with
little knowledge about the arm reward structure, so must necessarily suffer poor rewards in early rounds). To
assist you in this task, we have provided in the skeleton an offline_eval function that implements Algorithm 3
from the following paper:
Lihong Li, Wei Chu, John Langford, Robert E. Schapire, ‘A Contextual-Bandit Approach to Per-
sonalized News Article Recommendation’, in Proceedings of the Nineteenth International Conference
on World Wide Web (WWW 2010), Raleigh, NC, USA, 2010.
https://arxiv.org/pdf/1003.0146.pdf
This paper and the provided offline_eval function do not cover the delayed reward setting from Lancewicki
et al. (2021). Your task is to adapt offline_eval for delayed rewards, where the delay distribution is specified
by the delay argument. It may be helpful to read Section 4 of Li et al. (2010) to understand offline evaluation.5
Your adapted offline_eval function must conform to the interface specified in the function’s docstring. Note
that the delay argument, if specified, is assumed to be an instance of the DelayDistribution class defined in
the provided delay module. Three specific delay distributions are implemented in this module and imported
for use in subsequent sub-tasks. Hint: the docstrings for the DelayDistribution class explain how to use it to
generate a random delay conditioned on a pulled arm/observed reward.
Sub-task 1.2: Successive Elimination MAB
After a brief detour on offline evaluation, this sub-task and the remaining sub-tasks focus on MAB algorithms
from Lancewicki et al. (2020). The first MAB algorithm we consider is Successive Elimination for reward-
independent random delays. After reading the paper’s description in Section 3.2 and pseudocode in Algorithm 4,
you are to implement Successive Elimination in the SE skeleton class. Note: a minor detail is omitted in the
final line of Algorithm 4, which describes how arms are eliminated from the active set S. Before eliminating
an arm, you should ensure S will not become empty. In other words, you should stop eliminating arms once
|S| = 1.
Experiments. After completing the SE class, it is time to perform some basic experimentation.
(a) Run offline evaluation using the code provided in the notebook:
se_delay = ParetoDelay([1, 1, 0.2, 1, 1, 0.2, 1, 1, 1, 1], rng=rng)
se_mab = SE(n_arms, n_rounds, rng=rng)
se_rewards = offline_eval(se_mab, arms, rewards, delay=se_delay, n_rounds=n_rounds)
print("SE average reward", np.mean(se_rewards))
This will run offline evaluation on the given dataset (see Appendix A) for 20000 rounds with Pareto-
distributed random delays.
(b) Run offline evaluation as above, but now plot the cumulative average reward at each round, averaged over
10 repeats. We have imported matplotlib.pyplot to help here.
5Note that Li et al. (2010) consider contextual MABs, where the mean reward for each arm is assumed to be a linear function of
a context vector. You can ignore context vectors for this project, as we focus on non-contextual MABs. Also note that Li et al.’s
Algorithm 3 returns the sum of the rewards, whereas your offline eval function should return an array containing the reward
received for each round (where a played arm matches a logged event in the dataset).
4
Sub-task 1.3: Phased Successive Elimination MAB
In this sub-task, you are to implement a variant of Successive Elimination, called Phased Successive Elimina-
tion. This MAB algorithm is also designed for reward-independent random delays, however it achieves better
theoretical performance (regret) when the delay of the best arm is larger than other arms. You are to implement
Phased Successive Elimination in the PSE skeleton class, based on the description in Section 3.3 and pseudocode
in Algorithm 5. Note: As for Sub-task 1.2, there is a minor omission for arm elimination in Algorithm 5: you
should stop eliminating arms if the active set S or Sℓ would become empty.
Experiments. After completing the PSE class, run offline evaluation of your PSE class using the given dataset
with Packet-Loss random delays, similar to the part (a) experiment in Task 1.2. Then plot performance as in
the part (b) experiment in Task 1.2. This time evaluate PSE using the given dataset with Packet-Loss random
delays, and include SE as a baseline. That is, your evaluation should include a plot of the cumulative average
reward at each round for PSE and SE (on a single set of axes), averaged over 10 repeats.
Sub-task 1.4: Optimistic-Pessimistic Successive Elimination MAB
In this final sub-task, you are to implement another variant of Successive Elimination, called Optimistic-
Pessimistic Successive Elimination. Unlike the algorithms in Sub-tasks 1.2 and 1.3, this algorithm is designed
for reward-dependent random delays. After reading the paper’s description in Section 4 and pseudocode in Al-
gorithm 6, you are to implement Optimistic-Pessimistic Successive Elimination in the OPSE skeleton class. Note:
The final line in Algorithm 6 has the same omission as the previous sub-tasks: please follow the clarification in
Sub-task 1.2.
Experiments. After completing the PSE class, run offline evaluation (as in the two experimental parts of the
previous sub-tasks) using the given dataset with the provided reward-dependent random delay, as above. Then
include a plot of the cumulative reward at each round, averaged over 10 repeats.
Task 2: Thompson sampling MAB [4 marks total]
Your second task is to implement a Thompson sampling MAB by completing the TS Python class. Thompson
sampling is named after William R. Thompson who discovered the idea in 1933, before the advent of machine
learning. It went unnoticed by the machine learning community until relatively recently, and is now regarded
as a leading MAB technique. When applied to MABs, Thompson sampling requires a Bayesian probabilistic
model of the rewards.
An introductory example of Thompson sampling can be found by reading (less than a page) Section 1.2
up to and including Algorithm 1 of the following paper from COLT2012. In words, if you’re obtaining binary
rewards in {0, 1}, Thompson sampling would require first a Bayesian model of these rewards per arm. E.g.,
Bernoulli likelihoods with unknown parameter (coin bias) µi per arm i; and a common Beta prior describing
the distribution/belief of these µi. Whenever a reward is observed (after pulling an arm), we can update the
Thompson sampling MAB, specifically updating the posterior with new data. That is, starting with a common
prior across arms, different arms get different posteriors which become priors for those arms in the next round.
When deciding what to play, we sample a µi for each arm according to its prior/posterior. Using this parameter
per arm, we can calculate the expected reward per arm (simply µi itself in this case), and play the arm with
the highest expected reward. When two arms offer equal expected rewards, we tie-break uniformly-at-random
among value-maximising arms.
Shipra Agrawal and Navin Goyal, ‘Analysis of Thompson sampling for the multi-armed bandit
problem’, in Proceedings of the Conference on Learning Theory (COLT 2012), 2012.
http://proceedings.mlr.press/v23/agrawal12/agrawal12.pdf
5
Your task is to implement a Thompson sampling bandit by completing the TS Python class, just like
Algorithm 1 in the above COLT2012 paper, but suitable for more general, real-valued, rewards: you are to use
a Gaussian likelihood with mean distributed by a Gaussian (conjugate) prior. In more detail, assume rewards are
drawn for arm a following a Gaussian distribution with mean θa and (input, given hyperparameter) variance σ
2.
Place a Gaussian prior on θa with mean µ and variance τ
2. Both µ and τ are also treated as hyperparameters.
When updating, after observing a reward for arm a, you can compute the posterior distribution for the unknown
mean reward θa and update arm a’s prior with this posterior distribution. Hint: In this case the posterior is
again Gaussian with
mean
σ2µ+ τ2sa
τ2na + σ2
; and
variance
σ2τ2
τ2na + σ2
,
where na is the number of rewards observed for arm a so far, and sa is the sum of rewards observed for arm a.
Experiments. Once your TS class is implemented, it is time to perform some basic experimentation. Run
offline evaluation (as in the first task) without delays using the given dataset and hyperparameter values specified
in the notebook. Then include a plot of the cumulative average reward at each round, averaged over 10 repeats.
Task 3: Doubly-adaptive Thompson sampling MAB [6 marks total]
Many MAB algorithms, including UCB (covered in class) and Thompson sampling (Task 2), estimate the mean
reward of each arm by averaging reward samples. However, this may result in inaccurate estimates, since the
samples are gathered adaptively, with some arms being selected more frequently than others. While UCB and
Thompson sampling are theoretically optimal (in terms of regret) in the worst case, it may be possible to
improve the performance by accounting for adaptivity to obtain more accurate estimates of the mean reward.
In this task, you will implement a Thompson sampling MAB that accounts for adaptivity using doubly-robust
estimators, as proposed in the following paper:
Maria Dimakopoulou, Zhimei Ren and Zhengyuan Zhou. ‘Online Multi-Armed Bandits with Adap-
tive Inference’, in Advances in Neural Information Processing Systems 34 (NeurIPS’21), pp. 1939–
1951. 2021. https://openreview.net/pdf?id=kVHxBqPcn_
You may skim the first few pages of this paper up to Section 3.2, for motivation and background on reducing
the bias (error) of estimates computed using adaptive samples. The algorithm you are to implement is called
Doubly-adaptive Thompson sampling and is described in Section 3.3 and Algorithm 1. Your implementation
should be completed in the provided DATS skeleton class. Clarification: there is a typo in the arm elimination
procedure of Algorithm 1: mina′∈At should be replaced with mina′∈At\{a}. In other words, when computing pt+1,a
to decide whether to eliminate arm a, you should minimise over arms in the current active set At excluding arm
a. Hint 1: to compute the propensity score πt+1,a in the second-last line of Algorithm 1, you need to estimate
the probability that arm a returns the largest reward, assuming the reward r˜t+1,a′ for arm a
′ is distributed as
N (µˆt,a′ , κ2σˆ2t,a′). The authors recommend estimating πt+1,a by Monte Carlo simulation. Practically this can be
done by sampling n samples rewards from the posterior of each arm, and computing the relative frequency with
which arm a achieves the largest reward among the samples. Hint 2: you can use norm.cdf from scipy.stats
(imported for you) to compute the cumulative density function for the standard normal distribution.
Experiments. After implementing the DATS class, run offline evaluation without delays using the given
dataset, as in previous tasks. The hyperparameter values are specified for you in the notebook based on
the paper’s recommendations. Next, run offline evaluation to produce a plot, including TS from Task 2 as a
baseline. Your plot should display the cumulative average reward at each round for DATS and TS (on a single
set of axes), averaged over 10 repeats.
6
A Appendix: Dataset Description
Alongside the skeleton notebook, we provide a 1.2 MB dataset.txt file suitable for validating MAB implemen-
tations. You should ensure this file is located in the same directory as your local proj2.ipynb. It is formatted
as follows:
• 300,000 lines (i.e., rows) corresponding to distinct site visits by users—“events” in the language of Sub-
task 1.1;
• Each row comprises 2 space-delimited columns of integers:
– Column 1: The arm played by a uniformly-random policy out of 10 arms (news articles);
– Column 2: The reward received from the arm played—1 if the user clicked 0 otherwise.
The offline_eval function defined in Sub-task 1.1 should be able to run on data from this file where column 1
forms arms and column 2 forms rewards.
essay、essay代写