python代写-CM50270

CM50270 Reinforcement Learning
Coursework Part 2: Racetrack
In this exercise, you will implement and compare the performance of three reinforcement
learning algorithms: On-Policy First-Visit Monte-Carlo Control, Sarsa, and Q-Learning.
Total number of marks: 40 marks.
What to submit: Your completed Jupyter notebook (.ipynb file) which should include all of your
not include any identifying information on the files you submit. This coursework will be marked
anonymously.
If you are asked to use specific variable names, data-types, function signatures and notebook
cells, please ensure that you follow these instructions. Not doing so will cause the automarker
to reject your work, and will assign you a score of zero for that question. If the automarker
rejects your work because you have not followed the instructions, you may not get any
Please do not use any non-standard, third-party libraries apart from numpy and matplotlib. In
this part of the coursework, you should also use the racetrack_env file, which we have provided
for you. If we are unable to run your code because you have used unsupported external
libraries, you may not get any credit for your work.
Please be sure to restart the kernel and run your code from start-to-finish (Kernel → Restart
& Run All) before submitting your notebook. Otherwise, you may not be aware that you are using
variables in memory that you have deleted.
Your total runtime must be less than 8 minutes on the University's computers, and written
The Racetrack Environment
We have implemented a custom environment called "Racetrack" for you to use during this piece
of coursework. It is inspired by the environment described in the course textbook (Reinforcement
Learning, Sutton & Barto, 2018, Exercise 5.12), but is not exactly the same.
Environment Description
Consider driving a race car around a turn on a racetrack. In order to complete the race as quickly
as possible, you would want to drive as fast as you can but, to avoid running off the track, you
must slow down while turning.
In our simplified racetrack environment, the agent is at one of a discrete set of grid positions. The
agent also has a discrete speed in two directions, $x$ and $y$. So the state is represented as
follows:
$$(\text{position}y, \text{position}x, \text{velocity}y, \text{velocity}x)$$
The agent collects a reward of -1 at each time step, an additional -10 for leaving the track (i.e.,
ending up on a black grid square in the figure below), and an additional +10 for reaching the finish
line (any of the red grid squares). The agent starts each episode in a randomly selected grid-
square on the starting line (green grid squares) with a speed of zero in both directions. At each
time step, the agent can change its speed in both directions. Each speed can be changed by +1, -1
or 0, giving a total of nine actions. For example, the agent may increase its speed in the $x$
direction by -1 and its speed in the $y$ direction by +1. The agent's speed cannot be greater than
+10 or less than -10 in either direction.
The agent's next state is determined by its current grid square, its current speed in two directions,
and the changes it makes to its speed in the two directions. This environment is stochastic. When
the agent tries to change its speed, no change occurs (in either direction) with probability 0.2. In
other words, 20% of the time, the agent's action is ignored and the car's speed remains the same
in both directions.
If the agent leaves the track, it is returned to a random start grid-square and has its speed set to
zero in both directions; the episode continues. An episode ends only when the agent transitions
to a goal grid-square.

Environment Implementation
We have implemented the above environment in the racetrack_env.py file, for you to use in
modify the environment.
We provide a RacetrackEnv class for your agents to interact with. The class has the following
methods:
reset() - this method initialises the environment, chooses a random starting state, and
returns it. This method should be called before the start of every episode.
step(action) - this method takes an integer action (more on this later), and executes one
time-step in the environment. It returns a tuple containing the next state, the reward
collected, and whether the next state is a terminal state.
render(sleep_time) - this method renders a matplotlib graph representing the
environment. It takes an optional float parameter giving the number of seconds to display
each time-step. This method is useful for testing and debugging, but should not be used
during training since it is very slow. Do not use this method in your final submission.
get_actions() - a simple method that returns the available actions in the current state.
Always returns a list containing integers in the range [0-8] (more on this later).
In our code, states are represented as Python tuples - specifically a tuple of four integers. For
example, if the agent is in a grid square with coordinates ($Y = 2$, $X = 3$), and is moving zero
cells vertically and one cell horizontally per time-step, the state is represented as (2, 3, 0, 1) .
Tuples of this kind will be returned by the reset() and step(action) methods.
There are nine actions available to the agent in each state, as described above. However, to
simplify your code, we have represented each of the nine actions as an integer in the range [0-8].
The table below shows the index of each action, along with the corresponding changes it will
cause to the agent's speed in each direction.
For example, taking action 8 will increase the agent's speed in the $x$ direction, but decrease its
speed in the $y$ direction.
Racetrack Code Example
Below, we go through a quick example of using the RaceTrackEnv class.
First, we import the class, then create a RaceTrackEnv object called env . We then initialise the
environment using the reset() method, and take a look at the initial state variable and the
result of plot() .

As you can see, reset() has returned a valid initial state as a four-tuple. The function plot()
uses the same colour-scheme as described above, but also includes a yellow grid-square to
indicate the current position of the agent.
Let's make the agent go upward by using step(1) , then inspect the result (recall that action 1
increments the agent's vertical speed while leaving the agent's horizontal speed unchanged).
You can see that the agent has moved one square upwards, and now has a positive vertical speed
(indicated by the yellow arrow). Let's set up a loop to see what happens if we take the action a few
more times, causing it to repeatedly leave the track.
Exercise 1: On-Policy MC Control (8 Marks)
%matplotlib inline
# Set random seed to make example reproducable.
import numpy as np
import random
seed = 5
random.seed(seed)
np.random.seed(seed)
from racetrack_env import RacetrackEnv
# Instantiate environment object.
env = RacetrackEnv()
# Initialise/reset environment.
state = env.reset()
env.render()
print("Initial State: {}".format(state))
# Let us increase the agent's vertical speed (action 1).
next_state, reward, terminal = env.step(1)
env.render()
print("Next State: {}, Reward: {}, Terminal: {}".format(next_state, reward,
terminal))
num_steps = 50
for t in range(num_steps) :
next_state, reward, terminal = env.step(1)
env.render()
In this exercise, you will implement an agent which learns to reach a goal state in the racetrack
task using On-Policy First-Visit MC Control, the pseudocode for which is reproduced below
(Reinforcement Learning, Sutton & Barto, 2018, Section 5.4 p.101).
A tabular On-Policy First-Visit MC Control agent which learns an optimal policy in the
racetrack environment.
An average learning curve. Your learning curve should plot the mean undiscounted
return from many agents as a function of episodes. Please specify how many agents'
performances you are averaging in the title of your plot. This should be a dynamic figure that
we can produce by running your code. If you wish to use any kind of graph smoothing,
method does not cause artifacts near the edges of the plot.
Please use the following parameter settings:
Discount factor $\gamma = 0.9$.
For your $\epsilon$-greedy policy, use exploratory action probability $\epsilon = 0.15$.
Number of training episodes $= 150$.
Number of agents averaged should be at least 20.
If you use incorrect parameters, you may not get any credit for your work.
Exercise 2: Sarsa (4 Marks)
In this exercise, you will implement an agent which learns to reach a goal state in the racetrack
task using the Sarsa algorithm, the pseudocode for which is reproduced below (Reinforcement
Learning, Sutton & Barto, 2018, Section 6.4 p.129).
A tabular Sarsa agent which learns an optimal policy in the racetrack environment.
An average learning curve. Your learning curve should plot the mean undiscounted
return from many agents as a function of episodes. Please specify how many agents'
performances you are averaging in the title of your plot. This should be a dynamic figure that
we can produce by running your code. If you wish to use any kind of graph smoothing,
method does not cause artifacts near the edges of the plot.
Please use the following parameter settings:
Step size parameter $\alpha = 0.2$.
Discount factor $\gamma = 0.9$.
For your $\epsilon$-greedy policy, use exploratory action probability $\epsilon = 0.15$.
Number of training episodes $= 150$.
Number of agents averaged should be at least 20.
If you use incorrect parameters, you may not get any credit for your work.
# Please write your code for Exercise 1 in this cell or in as many cells as you
want ABOVE this cell.
# You should implement your MC agent and plot your average learning curve here.
# Do NOT delete this cell.
Exercise 3: Q-Learning (4 Marks)
In this exercise, you will implement an agent which learns to reach a goal state in the racetrack
task using the Q-Learning algorithm, the pseudocode for which is reproduced below
(Reinforcement Learning, Sutton & Barto, 2018, Section 6.5 p.131).
A tabular Q-Learning agent which learns an optimal policy in the racetrack environment.
An average learning curve. Your learning curve should plot the mean undiscounted
return from many agents as a function of episodes. Please specify how many agents'
performances you are averaging in the title of your plot. This should be a dynamic figure that
we can produce by running your code. If you wish to use any kind of graph smoothing,
method does not cause artifacts near the edges of the plot.
Please use the following parameter settings:
Step size parameter $\alpha = 0.2$.
Discount factor $\gamma = 0.9$.
For your $\epsilon$-greedy policy, use exploratory action probability $\epsilon = 0.15$.
Number of training episodes $= 150$.
Number of agents averaged should be at least 20.
If you use incorrect parameters, you may not get any credit for your work.
Hint: Your Q-Learning implementation is likely to be similar to your Sarsa implementation. Think
hard about where these two algorithms differ.
Exercise 4: Comparison & Discussion (8 Marks)
Please produce a plot which shows the data from your previous three graphs plotted together on
the same set of axes. Be sure to include plot elements such as axis labels, titles, and a legend, so
that it is clear which data series corresponds to the performance of which agent. If we are not
able to easily interpret your plots, you may not get any credit for your work.
Please note that you should not re-train your agents from scratch. You should re-use your
results from the previous exercises.
# Please write your code for Exercise 2 in this cell or in as many cells as you
want ABOVE this cell.
# You should implement your sarsa agent and plot your average learning curve
here.
# Do NOT delete this cell.
# Please write your code for Exercise 3 in this cell or in as many cells as you
want ABOVE this cell.
# You should implement your q-learning agent agent and plot your average learning
curve here.
# Do NOT delete this cell.
To improve the visual clarity of your graphs, you may wish to apply some kind of cropping or
smoothing. If you choose to do this, please also include an un-altered version of your graph.
Please ensure that any graph smoothing method you use does not cause artifacts near the edges
of the plot.
In eight sentences or fewer, please discuss the following:
The performance of your different agents.
Why each of your agents performed differently.
Explain the differences you saw, and expected to see, between the performances and polices
of your Sarsa and Q-Learning agents.
What could be done to improve the performance of your agents?

Exercise 5: Modified Q-Learning Agent (16 Marks)
Exercise 5a: Implementation
In this exercise, you will implement an agent which learns to reach a goal state in the racetrack
task using the Q-Learning algorithm, the pseudocode for which is reproduced below
(Reinforcement Learning, Sutton & Barto, 2018, Section 6.5 p.131).
In order to score high marks in this exercise, you will need to extend your solution beyond a
simple Q-Learning agent to achieve more efficient learning (i.e., using fewer interactions with the
environment). Ideas for improving your agent will have been discussed in lectures, and can be
found in the course textbook (Reinforcement Learning, Sutton & Barto, 2018). However you go
about improving your agent, it must still use tabular Q-Learning at its core.
A tabular Q-Learning agent, with whatever modifications you believe are reasonable in
order to acheieve better performance in the Racetrack domain.
An average learning curve. Your learning curve should plot the mean undiscounted
return from many agents as a function of episodes. Please specify how many agents'
performances you are averaging in the title of your plot. This should be a dynamic figure that
we can produce by running your code. If you wish to use any kind of graph smoothing,
method does not cause artifacts near the edges of the plot.
Please use the following parameter settings:
Number of training episodes $= 150$.
Number of agents averaged should be at least 20.
If you use incorrect parameters, you may not get any credit for your work.
# Please write your code for Exercise 4 in this cell or in as many cells as you
want ABOVE this cell.
# You should plot your combined graph here, clearly showing each of the average
learning curves of your three agents.
# Do NOT delete this cell.
You may adjust all other parameters as you see fit.
Exercise 5b: Comparison & Discussion
Please produce a plot which shows the performance of your original Q-Learning agent and your
modified Q-Learning agent. Be sure to include plot elements such as axis labels, titles, and a
legend, so that it is clear which data series corresponds to the performance of which agent. If we
are not able to easily interpret your plots, you may not get any credit for your work.
Please note that you should not re-train your agents from scratch. You should re-use your
results from the previous exercises.
To improve the visual clarity of your graphs, you may wish to apply some kind of cropping or
smoothing. If you choose to do this, please also include an un-altered version of your graph.
Please ensure that any graph smoothing method you use does not cause artifacts near the edges
of the plot.
In eight sentences or fewer, please discuss the following:
The modifications that you have made to your agent beyond implementing basic Q-Learning.
were met.
Further modifications you believe may enhance the performance of your agent, or changes
you would make if you had more time.
Please note that your implementation and discussion will be assessed jointly. This means
that, in order to score highly, you will need to correctly implement appropriate modifications to
your agent AND discuss them well.

# Please write your code for Exercise 5a in this cell or in as many cells as you
want ABOVE this cell.
# You should implement your modified q-learning agent agent and plot your average
learning curve here.
# Do NOT delete this cell.
# Please write your code for Exercise 5b in this cell or in as many cells as you
want ABOVE this cell.
# You should plot your combined graph here, clearly showing the average learning
curves of your
# original and modified Q-Learning agents.
# Do NOT delete this cell.