python代写-CS 486/686-Assignment 4
时间:2022-03-28
CS 486/686 Assignment 4 (66 marks)
Blake VanBerlo
Due Date: 11:59 PM ET on Tuesday, April 5, 2022 with an extension with
no penalty to 11:59 pm ET on Thursday, April 14, 2022
Changes
• v1.1: Fixed typo in Q1(a)
1
CS 486/686 Winter 2022 Assignment 4
Academic Integrity Statement
If your written submission on Learn does not include this academic integrity statement with
your signature (typed name), we will deduct 5 marks from your final assignment mark.
I declare the following statements to be true:
• The work I submit here is entirely my own.
• I have not shared and will not share any of my code with anyone at any point.
• I have not posted and will not post my code on any public or private forum or website.
• I have not discussed and will not discuss the contents of this assessment with anyone
at any point.
• I have not posted and will not post the contents of this assessment and its solutions
on any public or private forum or website.
• I will not search for assessment solutions online.
• I am aware that misconduct related to assessments can result in significant penalties,
possibly including failure in the course and suspension. This is covered in Policy 71:
https://uwaterloo.ca/secretariat/policies-procedures-guidelines/policy-71.
Failure to accept the integrity policy will result in your assignment not being graded.
By typing or writing my full legal name below, I confirm that I have read and understood
the academic integrity statement above.
©Blake VanBerlo 2022 v1.1 Page 2 of 14
CS 486/686 Winter 2022 Assignment 4
Instructions
• Submit any written solutions in a file named writeup.pdf to the A4 Dropbox on Learn.
If your written submission on Learn does not contain one file named writeup.pdf, we
will deduct 5 marks from your final assignment mark.
• Submit any code to Marmoset at https://marmoset.student.cs.uwaterloo.ca/.
Grades for the programming component are determined by the unit tests in the “As-
signment 4 (Final)” project on Marmoset The “Assignment 4 (Week 1)” and “As-
signment 4 (Week 2)” projects contain ungraded public test suites meant to help you
debug, and they are only temporarily available.
• No late assignment will be accepted. This assignment is to be done individually.
• I strongly encourage you to complete your write-up in Latex, using this source file.
If you do, in your submission, please replace the author with your name and student
number. Please also remove the due date, the Instructions section, and the Learning
goals section.
• Lead TAs:
– Paulo Pacheco
– Dake Zhang
The TAs’ office hours will be posted on MS Teams.
Learning goals
Solving Markov Decision Processes with Value Iteration
• Trace the execution of the value iteration algorithm for solving a Markov Decision
Process.
Reinforcement Learning
• Implement active adaptive dynamic programming, active Q-learning, and active SARSA.
• Implement various exploration methods for reinforcement learning.
©Blake VanBerlo 2022 v1.1 Page 3 of 14
CS 486/686 Winter 2022 Assignment 4
1 A Gambling Game (30 marks)
You are out at the casino for a night of gambling with your friends. You head over to a cards
table where the dealer is advertising a new game. You are interested, so the dealer explains
the rules.
When the game starts, you are dealt a card from a deck. The deck contains only 1, 2, 3, and
4 cards, in equal amounts. Cards are drawn with replacement. You are given the option to
“Draw” or “Stop”. If you say “Draw”, the dealer will give you another card. If the sum of
the card values in your hand is greater than 8, the game ends and you win no money. If the
sum of card values in your hand is 8, then you win $10. If you say “Stop”, then the game is
ends and you receive $1 if total of the points of the cards in your hand is less than or equal
to 3, or $4 if total of the points of the cards in your hand is at least 4 but less than or equal
to 7. For example, if you say “Stop” and your hand contains a 3 and a 4, then you win $4.
Please complete the following tasks.
(a) You will formulate this game as a Markov Decision Process.
We provide the set of states, which is as follows: S = {s1, s2, s3, s4, s5, s6, s7, s8, sA, sB, sC}.
For x ∈ {1, 2, . . . , 7}, sx refers to the nonterminal states where you have cards in your
hand whose total is x. s8 is a terminal state that results from the player’s hand totalling
exactly 8. sA is a terminal state that results when the player says “Stop” and their
hand’s card total is in {1, 2, 3}, winning $1. sB is a terminal state that results when
the player says “Stop” and their hand’s card total is in {4, 5, 6, 7}, winning $4. sC is
a terminal state that results when the sum of the cards in the player’s hand is greater
than 8, winning $0.
Please define the remaining components of this Markov Decision Process:
• A: The set of actions
• P (s′|s, a): The transition distribution
• P (s0): The initial state distribution
• R(s): The reward function (money earned in the state)
Please express the transition distribution in tabular form. You can fill in the table below
with your transition probabilities:
©Blake VanBerlo 2022 v1.1 Page 4 of 14
CS 486/686 Winter 2022 Assignment 4
State P (s′|s,Draw)
s1 s2 s3 s4 s5 s6 s7 s8 sA sB sC
s1
s2
s3
s4
s5
s6
s7
State P (s′|s, Stop)
s1 s2 s3 s4 s5 s6 s7 s8 sA sB sC
s1
s2
s3
s4
s5
s6
s7
Marking Scheme:
(10 marks)
• (2 marks) Correct definition for the set of actions
• (3 marks) Correct definition of the transition distribution
• (1 mark) Correct definition of the initial state distribution
• (2 marks) Correct definition of the reward function
• (2 marks) Formulation is clear and concise
(b) Execute value iteration for this MDP to determine the optimal expected utilities for each
state, V ∗(s). Use γ = 1. Terminate when max
s
|Vi(s)− Vi−1(s)| = 0.
Please round your values to 4 decimal places. You do not need to show your calculations.
Continue the table below with the value function for each iteration. The first row of
the table already contains V0(s), which are the initial value estimates for each state.
For nonterminal states, the initial value estimate is 0. For terminal states, the initial
value estimates are the rewards received when entering them. Fill in one row for each
iteration. That is, row i should contain Vi(s). You may not need to use all the rows, but
you will not need more rows. Once you are finished, be sure to clearly identify V ∗(s).
©Blake VanBerlo 2022 v1.1 Page 5 of 14
CS 486/686 Winter 2022 Assignment 4
Note that you are given the values for the terminal states, which do not change: V ∗(s8) =
10, V ∗(sA) = 1, V ∗(sB) = 4, and V ∗(sC) = 0.
s s1 s2 s3 s4 s5 s6 s7
V0(s) 0 0 0 0 0 0 0
V1(s)
V2(s)
V3(s)
V4(s)
V5(s)
V6(s)
V7(s)
V8(s)
V9(s)
V10(s)
Marking Scheme:
(10 marks)
• (8 marks) Fraction of correctly completed iterations
• (2 marks) Correct V ∗(s)
(c) Based on the V ∗(s) that you calculated in the previous question, determine the optimal
policy for this game, π∗(s). To do so, complete the following table. Round all calculations
to 4 decimal places. If there is a tie between the optimal Q-values for state s, set
π∗(s) = Draw.
s Q∗(s,Draw) Q∗(s, Stop) π∗(s)
s1
s2
s3
s4
s5
s6
s7
Marking Scheme:
(7 marks)
©Blake VanBerlo 2022 v1.1 Page 6 of 14
CS 486/686 Winter 2022 Assignment 4
• (7 marks) Correct Q(s, a) and π∗(s)
(d) If the payoff for having a hand total of 8 is large enough, then the optimal policy might
change for s7. How large would R(s8) have to be in order to change π
∗(s7) from the
value that you calculated in the previous question?
Marking Scheme:
(3 marks)
• (2 marks) Correct value for R(s8)
• (1 marks) Concise explanation
©Blake VanBerlo 2022 v1.1 Page 7 of 14
CS 486/686 Winter 2022 Assignment 4
2 Reinforcement Learning (36 marks)
You will explore the wumpus world using the Q-learning and SARSA algorithms.
Section 2.1 describes our version of the wumpus world. Section 2.2 describes the Q-learning
and SARSA algorithms. Section 2.3 describes action selection strategies that manage the
exploration/exploitation trade-off.
We have provided three Python files. Please read the comments in the provided files carefully.
1. wumpus env.py
contains a WumpusWorld class that simulates the dynamics of the wumpus world. It
implements the OpenAI gym interface1. Do not change this file.
2. utils.py
contains functions for random sampling from various distributions. To pass the unit
tests, use these functions for ALL random sampling in your implementation.
Do not change this file.
3. rl.py
implement all the empty functions except for select action(). You will use select action()
to sample actions. Do not change any function signatures.
Please complete the following tasks.
1. Implement all the empty functions in rl.py. Submit rl.py only on Marmoset.
Marking Scheme: (30 marks)
Unit tests
• select_action_epsilon_greedy
(1 public test + 2 secret tests) * 1 mark = 3 marks
• select_action_softmax
(1 public test + 2 secret tests) * 2 marks = 6 marks
• select_action_optimistically
(1 public test + 2 secret tests) * 1 mark = 3 marks
1OpenAI gym environments are commonly used by researchers as benchmarks to assess the performance
of reinforcement learning methods.
©Blake VanBerlo 2022 v1.1 Page 8 of 14
CS 486/686 Winter 2022 Assignment 4
• active_q_learning
(1 public test + 2 secret tests) * 3 marks = 9 marks
• active_sarsa
(1 public test + 2 secret tests) * 3 marks = 9 marks
2. You will investigate the performance of Q-learning and SARSA with different ex-
ploration strategies on the Wumpus World environment.
Produce four plots for the following combinations.
• Q-learning + softmax exploration with T = 0.1
• SARSA + softmax exploration with T = 0.1
• Q-learning + optimistic utility exploration with Ne = 2, R+ = 999
• SARSA + optimistic utility exploration with Ne = 2, R+ = 999
Use the learning rate α = 0.5 and discount factor γ = 0.99. Instantiate the Wumpus
World environment with the default constructor arguments.
Run the program 3 times. Each time, run the program for 50, 000 episodes. For each
run, use a different random seed by calling WumpusWorld.seed(s) prior to learning. In
each plot, the x-axis denotes the episodes, and the y-axis denotes the total undiscounted
reward obtained in each episode averaged over the 3 runs.
To make the plot smoother, we will plot each value to be the moving average of the
previous 50 episodes. A moving average is defined as the average of the last N elements
in a series, where N is the window size. You can use the Python command below to
calculate a moving average of a 1D numpy array.
rewards_mov_avg = np.convolve(rewards, np.ones(N)/N, mode=‘valid’)
We have included an example plot in Figure 1 depicting the performance of Q-learning
with ϵ-greedy exploration (ϵ = 0.1).
Marking Scheme: (4 marks)
• (1 mark) Plot for Q-learning with softmax exploration
• (1 mark) Plot for SARSA with softmax exploration
• (1 mark) Plot for Q-learning with optimistic utility estimates
• (1 mark) Plot for SARSA with optimistic utility estimates
2. If you were to explore the wumpus world, which learning algorithm and exploration
strategy would you choose among the options below?
©Blake VanBerlo 2022 v1.1 Page 9 of 14
CS 486/686 Winter 2022 Assignment 4
Figure 1: Q-learning with ϵ-greedy action selection (ϵ = 0.1) averaged across 3 runs,
smoothed with a moving average (N = 50). Note the poor performance – this is your
baseline.
• Q-learning v.s. SARSA
• softmax v.s. optimistic utility
Provide your answer and justify it with no more than 3 sentences. You can base your
answer on the previous parts or perform more experiments yourself.
Marking Scheme: (2 marks) A reasonable justification.
2.1 The Wumpus World
The following description is adapted from Artificial Intelligence: A Modern Approach by
Russell and Norvig. Check section 7.2 on page 236 in the 3rd edition or section 7.2 on page
210 in the 4th edition. The only difference between our version and the textbook version
is that the three movement actions (Forward, TurnLeft, TurnRight) has no effect with a
small probability κ.
The wumpus world is a cave consisting of rooms connected by passageways. Lurking some-
where in the cave is the terrible wumpus, a beast that eats anyone who enters its room.
The agent can shoot the wumpus, but the agent has only one arrow. Some rooms contain
bottomless pits that will trap the agent if the agent wonders into these rooms. The only
redeeming feature of this bleak environment is the possibility of finding a heap of gold.
The agent takes one of the following actions during each time step:
©Blake VanBerlo 2022 v1.1 Page 10 of 14
CS 486/686 Winter 2022 Assignment 4
• Forward: This action is noisy with an error probability of κ ∈ [0, 1]. With a probability
of 1− κ, the agent moves to the next square in the direction the agent is facing. If the
agent bumps into a wall, it remains in the same square. With a probability of κ, the
action has no effect and the agent remains in the same square.
• TurnLeft: This action is noisy with an error probability of κ ∈ [0, 1]. With a probabil-
ity of 1−κ, the agent rotates 90 degrees to the left (counterclockwise) and remains
in the same square. With a probability of κ, the action has no effect and the agent
faces the same direction as before.
• TurnRight: This action is noisy with an error probability of κ ∈ [0, 1]. With a proba-
bility of 1− κ, the agent rotates 90 degrees to the right (clockwise) and remains in
the same square. With a probability of κ, the action has no effect and the agent faces
the same direction as before.
• Grab: The agent obtains the gold if the gold is in the same square as the agent.
Otherwise, nothing happens.
• Shoot: The agent fires an arrow in a straight line in the direction the agent is facing.
The arrow continues until it hits and kills a wumpus or hits a wall. The agent has one
arrow. If the agent has already used the arrow, this action has no effect.
• Climb: The agent climbs out of the cave if it is at the entrance. Otherwise, this action
has no effect. Once the agent climbs out of the cave, this episode of the wumpus world
journey ends.
The agent may obtain the following rewards.
• +1000 for climbing out of the cave with the gold.
• −1000 for falling into a pit.
• −1000 for being eaten by the wumpus.
• −1 for each action taken (even if the action has no effect).
• −10 for using up the arrow.
In addition to knowing its current location and orientation, the agent may receive the fol-
lowing observations:
• In the squares directly (not diagonally) adjacent to the wumpus, the agent will perceive
a stench.
• In the squares directly (not diagonally) adjacent to a pit, the agent will perceive a
breeze.
©Blake VanBerlo 2022 v1.1 Page 11 of 14
CS 486/686 Winter 2022 Assignment 4
• In the square where the gold is, the agent will perceive a glitter.
• When the agent bumps into a wall, it will perceive a bump.
• When the wumpus is killed, it emits a woeful scream that can be perceived anywhere
in the cave.
2.2 The Active Q-Learning and SARSA Algorithms
Let’s define some quantities that we will use in both algorithms.
• γ: the discount factor in (0, 1].
• R(s): the immediate reward of entering any state.
• Q(s, a): the Q-values for each state-action pair. This is a 2D numpy array as shown
below.
Q = np.zeros((env.n_states, env.n_actions))
• N(s, a): the number of times the agent has tried each state action pair (s, a).
• α: the learning rate (α > 0).
• E: the number of episodes in which the agent collect experience.
The Q-learning algorithm is in Algorithm 1. The pseudocode differs from that in the lecture
notes because we are only executing the main loop for a fixed number of episodes, rather than
checking for convergence of the Q-values. See section 2.3 for a description of the exploration
strategies you may use. Note that Q-learning is off-policy ; that is, we don’t use the current
policy to estimate the actual Q-value, assuming instead that a greedy policy is followed.
The SARSA algorithm is given in Algorithm 2. SARSA stands for state-action-reward-state-
action. Like Q-learning, it applies a temporal difference update rule to learn the Q-values.
Unlike Q-learning, SARSA is on-policy algorithm, meaning that the Q-values are estimated
by using the agent’s current policy.
2.3 Managing the Trade-off between Exploration and Exploitation
During active reinforcement learning, there is a trade-off between exploration and exploita-
tion. You will implement three different strategies for managing this trade-off.
©Blake VanBerlo 2022 v1.1 Page 12 of 14
CS 486/686 Winter 2022 Assignment 4
Algorithm 1 Active Q-learning
Require: α > 0 ▷ Learning rate
Require: γ ∈ (0, 1] ▷ Discount factor
Q(s, a)← 0 ▷ Initialize Q-values
N(s, a)← 0 ▷ Initialize state-action visit counts (for action selection)
for i ∈ {1, 2, . . . , E} do ▷ Conduct E episodes
s← env.reset() ▷ Get initial state
while s is not terminal do ▷ Take actions until the state is terminal.
a← select action(s) ▷ Select action according to action strategy
s′, r ← env.step(a) ▷ Take action a and generate an experience ⟨s, r, a, s′⟩
Q(s, a)← Q(s, a) + α
(
r + γmaxa′ Q(s
′, a′)−Q(s, a)
)
▷ Update Q(s, a).
N(s, a)← N(s, a) + 1 ▷ Update state-action visit count
s← s′ ▷ Update current state
end while
end for
return Q
Algorithm 2 SARSA
Require: α > 0 ▷ Learning rate
Require: γ ∈ (0, 1] ▷ Discount factor
Q(s, a)← 0 ▷ Initialize Q-values
N(s, a)← 0 ▷ Initialize state-action visit counts (for action selection)
for i ∈ {1, 2, . . . , E} do ▷ Conduct E episodes
s← env.reset() ▷ Get initial state
a← select action(s) ▷ Select initial action according to action strategy
while s is not terminal do ▷ Take actions until the state is terminal.
s′, r ← env.step(a) ▷ Take action a and generate an experience ⟨s, r, a, s′⟩
a′ ← select action(s′) ▷ Select next action according to action strategy
Q(s, a)← Q(s, a) + α
(
r + γQ(s′, a′)−Q(s, a)
)
▷ Update Q(s, a).
N(s, a)← N(s, a) + 1 ▷ Update state-action visit count
s← s′ ▷ Update current state
a← a′ ▷ Update current action
end while
end for
return Q
Important: To implement these strategies, you will need to select random samples and
apply the argmax operation. To ensure reproducibility for testing purposes, please use the
following functions provided in utils.py.
• sample integer from categorical distribution(P): samples an integer from a cat-
©Blake VanBerlo 2022 v1.1 Page 13 of 14
CS 486/686 Winter 2022 Assignment 4
egorical distribution with probability distribution function P, which is a 1D numpy
array
• argmax with random tiebreaking(v): Takes the argmax of the 1D array v, breaking
ties with uniformly random selection from the maximal elements
Below are brief descriptions of the three strategies you will implement in this assignment.
1. ϵ-greedy strategy: ϵ is the probability of taking a random action. The agent takes
a random action with a probability of ϵ and takes the action with the largest Q value
with a probability of 1− ϵ.
To implement ϵ-greedy, first sample an integer from the categorical distribution ⟨ϵ, 1−
ϵ⟩. If 0 is returned, sample the action from a uniform categorical distribution over
actions. Otherwise if 1 is returned, select the action with the maximum Q-value.
2. Softmax action selection: The agent selects each action with a probability that is
proportional to the Q value for the action. The probability of choosing action a in
state s is given below.
eQ(s,a)/T∑
a′
eQ(s,a
′)/T
(1)
T is a parameter called the temperature. T affects the shape of the distribution. When
T is large, the distribution is close to being uniform. As T decreases, the distribution
becomes better at distinguishing actions with different Q values. As T approaches 0,
the distribution approaches a point mass with a probability of 1 for the action with
the largest Q value.
Hint: You may encounter overflow when Q is very large. Try shifting the Q-values (i.e.
subtracting a scalar) before plugging them into the softmax formula.
3. Optimistic utility estimates: Another strategy for managing the trade-off is to
modify the utility estimates instead of modifying the action selection strategy. Then
select the greedy action with respect to the utility estimates, Q+. Until the state-action
pair (s, a) has been visited Ne times, the utility estimate is the maximum possible
reward obtainable in any state (hence optimistic). Afterwards, the utility estimate is
the expected utility, Q(s, a).
Q+(s, a)← R(s) + γmax
a
f(Q+(s, a), N(s, a))
f(u, n) =
{
R+, if n < Ne
u, otherwise
©Blake VanBerlo 2022 v1.1 Page 14 of 14


essay、essay代写