Advanced Machine Learning Assignment — 2020
Submission
Please submit your solution electronically via vUWS. Submit a report as
PDF and your
code zipped into one file, and please include the signed and completed
cover sheet
that you can find at the end of the document.
Submission is due on 15 Oct 2020, 11:59pm.
Minipong
Figure 1: 4 frames from our data, using +1 valued pixels for +, , 1 for
the paddle.
In this assignment we work with data and a simulation of a simple
version of “pong”.
Two objects appear on the field: a + object as “ball”, and a paddle that
can take
different spots, but only in the bottom row. Pixels of the two objects
are represented
with different the values s 1 and 1 while background pixels have the
value 0. The two
markers at the top corners are fixed (( 1 and +1, respectively) and
appear in every frame.
Preparation Download the Minipong.py and sprites.py python files. The
class Minipong.py implements the pong game simulation. Running
sprites.py
will create datasets of pong screenshots for your first task.
A new pong game can be created like here:
from minipong import Minipong
pong = Minipong(level=1, size=5)
In this, level sets the information a RL agent gets from the
environment, and size sets
the size of the game (in number of different paddle positions). Both
paddle and + are 3
pixels wide, and cannot leave the field. A game of size 5 is (15×15)
pixels, and the ball
x- and y-coordinates can be values between 1 and 13. The paddle can be
in 5 different
locations (from 0 to 4).
Task 1: Train a CNN to predict object positions
15 points
The python program sprites.py creates a training and test set of
“minipong” scenes,
trainingpix.csv (676 samples) and testingpix.csv (169 samples). Each
row represents a 15×15 screenshot (flattened in row-major order). Labels
appear in separate files, traininglabels.csv and testlabels.csv. They
contain 3 labels
for each example (x/y/z), the x/y-coordinates for the + marker with
values between 1
and 13, and z between 0...4, for the location of the paddle.
Steps
1. Create the datasets by running the sprites.py code.
2. Create a CNN that predicts the x-coordinate of the + marker.
• You can (but don’t have to) use an architecture similar to what we
used
for classifying MNIST, but be aware the input dimensions and outputs are
different, so you will have to make at least some changes.
• You can normalise/standardise the data if it helps improve the
training.
3. Create a CNN that predicts all three outputs (x/y/z) from each
input1. • Compute the accuracy on the test data set.
What to submit:
• Submit the python code of your solutions (two versions).
• For your report, write a brief description of your steps to create the
models and
your prediction. What did you do? Please also include answers to the
following
questions:
– What loss did you use, why? What is your loss for the second model?
– For how long did you train your model (number of epochs, time taken)?
What is the performance on the test set?
• For all solutions: the way you try to solve tasks and your description
is more
important that absolute performance of your code. If things do not work
as you
hope, submit your steps and describe what the specific problem is.
1As an intermediate step, consider 3 separate networks, one for each
output. Then try merge these
into one network with 3 outputs.
2
Task 2: Train a convolutional autoencoder
10 points
Instead of predicting positions, create a convolutional autoencoder that
compresses the
pong screenshots to a small number of bytes (the encoder), and
transforms them back
to original (in the decoder part).
Steps
1. Create and train an (undercomplete) convolutional autoencoder and
train it using
the training data set from the first task.
2. You can choose the architecture of the network and size of the
representation
h = f(x). The goal is to learn a representation that is smaller than the
original,
and still leads to recognisable reconstructions of the original.
3. For the encoder you can use the same architecture that you used for
the first task,
but you cannot use the labels for training. You can also create a
completely different architecture.
4. (No programming): In theory, what would be the absolute minimal size
of the
hidden layer representation that allows perfect reconstruction of the
original image?
What to submit:
• Submit the python code of your solution.
• For your report, write a brief description of your steps to create the
models and
your prediction. What did you do (e.g., what loss function, how big is
the encoded
image in your architecture, how many steps did the learning take)?
• Include screenshots of 1-2 output images next to the original inputs
(e.g., select a
good and a bad example).
Task 3: Create a RL agent for Minipong (level 1)
15 points
The code in minipong.py provides an environment to create an agent that
can be
trained with reinforcement learning (a complete description at the end
of this sheet). It
uses the objects as described above. The following is a description of
the environment
dynamics:
3
• The + marker moves a diagonal step at each step of the environment.
When it
hits the paddle or a wall (on the top, left, or right) it reflects.
• The agent can control the paddle ( ), by moving it one 3-pixel slot
every step.
The agent has three actions available: it can choose to do nothing, or
it can move
it to the left or right. The paddle cannot be moved outside the
boundaries.
• The agent will receive a positive reward when the + reflects from the
paddle. In
this case, the + may also move by 1 or 2 random pixels to the left or
right.
• An episode is finished when + reaches the bottom row without
reflecting from
the paddle.
In a level 1 version of the game, the observed state (the information
made available to
the agent after each step) consists of one number: dz. It is the
relative position of the
+, relative to the centre of the paddle: a negative number if + is on
one side, a positive
one on the other.
For this task, you can initialise pong like this:
pong = Minipong(level=1, size=5)
or like this:
pong = Minipong(level=1, size=5, normalise = False)
In the first version, step() returns normalised values of dz (values
between n 1...1)
for the state, while in the second version it returns pixel differences
(( 13...13).
Steps
1. Manually create a policy (no RL) that successfully plays pong, just
selecting actions based on the state information. The minipong.py code
contains a template that you can use and modify.
2. Create a (tabular or deep) TD agent that learns to play pong. For
choosing actions
with -greedy action selection, set = 1, initially, and reduce it
during your
training to a minimum of 0.1.
3. Run your training, resetting after every episode. Store the sum of
rewards. After
or during the training, plot the total sum of rewards per episode. This
plot — the
Training Reward plot — indicates the extent to which your agent is
learning to
improve his cumulative reward. It is your decision when to stop
training. It is not
required to submit a perfectly performing agent, but show how it learns.
4
4. After you decide the training to be completed, run 50 test episodes
using your
trained policy, but with = 0.0 for all 50 episodes. Again, reset the
environment
at the beginning of each episode. Calculate the average over
sum-of-rewards-perepisode (call this the Test-Average), and the
standard deviation (the Test-StandardDeviation). These values indicate
how your trained agent performs.
5. If you had initialised pong with pong = Minipong(level=2, size=5),
the observed state would consist of 2 values: the ball y-coordinate, and
the relative
+- position dz from level 1. Will this additional information help or
hurt the
learning? (No programming required).
What to submit:
• Submit the python code of your solutions (both the manual strategy,
and the code
of your RL learner).
• For your report, describe the solution, mention the Test-Average and
Test-StandardDeviation, and include the Training Reward plot described
above. After how
many episodes did you decide to stop training, and how long did it take?
• Please don’t forget to include the answer about the level 2 version
question.
Task 4: Create a RL agent for Minipong (level 3)
10 points
In a level 3 version of the game, the observed state (the information
made available to
the agent after each step) consists of three number: y, dx, dz. These
are y, the ball ycoordinate; dx, the change in ball x-coordinate from
last step to now; and dz (same as
previous levels).
For this task, you can initialise pong in two ways:
pong = Minipong(level=3, size=5)
pong = Minipong(level=3, size=5, normalise = False)
In the first version, step() returns normalised values of y and dz
(values between
粆 1...1), while in the second version these values are unnormalised. The
dx values are
always unnormalised (but should be -1 or 1 in most cases, except after
the paddle has
been hit).
Steps
1. Create a (neural-network based) RL agent that finds a policy using
(all) level 3
state information. Use a discount factor γ = 0.95. 5
2. You can choose the algorithm (deep TD or deep policy gradient).
3. Try to train an agent that achieves a running reward > 300 (the
minipong.py
file has an example for how to calculate this).
4. Don’t go overboard with the number of hidden layers as this will
significantly
increase training time. Try one hidden layer.
5. Write a description explaining how your approach works, and how it
performs. If
some (or all) of your attempts are unsuccessful, also describe some of
the things
that did not work, and which changes made a difference.
What to submit:
• Submit the python code of your solutions.
• For your report, describe the solution, mention the Test-Average and
Test-StandardDeviation, and include the Training Reward plot described
above.
Tips
1. For the RL-tasks, it often takes some time until the learning picks
up, but they
should not take hours. If the agent doesn’t learn, explore different
learning rates.
For Adam, try values between 5e-3 (faster) and 1e-4 (slower).
2. Even if the learning does not work, remember that we would like to
see that you
understood the ideas behind the code. So describe the ideas that you
tried, and
still submit your code but say what the problem was.
Bonus questions
I can do it Neural Networks: Train a neural network that predicts the dz
variable.
Bring it on Pong level 3: Modify the learning or the reward from the
environment so
the agent avoids moving the paddle unnecessarily. Compare the learned
policies.
Hardcore Train an autoencoder where you can use the encoded image as
input to an
RL agent that successfully plays pong.
Nightmare Solve Minipong(level = 0) in pytorch. In this level, the state
is a
difference image (pixels) between the current state and the previous
state. Check
http://karpathy.github.io/2016/05/31/rl/ for tips.
6
Minipong.py
If you put the minipong.py file into your working directory, you can
import the class
like this:
from minipong import Minipong
big = 7
pong = Minipong (level =1, size=big)
The Minipong class has several functions that you will have to use. The
file contains
an example and an explanation for many of the functions (check it out),
but here is a
brief list:
pong = Minipong (level=level , size=size)
n = pong. observationspace ()
state = pong.state ()
state = pong. transition (action)
done = pong.terminal ()
r = pong.reward ()
state , r, done = pong.step(action)
state = pong.reset ()
action = pong. sampleaction ()
pong.render(text = False , reward=r)
pix = pong. to pix (pong.s1)
You can ask or answer questions about how to use the files provided with
this assignment on discord, as long as they are general python /
programming questions, for example if the code provided does not work
for you as expected. You must not ask or answer
questions to the machine learning questions in this assignment anywhere,
including discord. If in doubt, ask your friendly lecturers or tutor
first.
7
Assignment Cover Sheet
School of Computer, Data, and Mathematical Sciences
Student Name
Student Number
Unit Name and Number 301119: Advanced Machine Learning
Title of Assignment Assignment 1
Due Date 15 Oct 2020
Date Submitted
DECLARATION
I hold a copy of this assignment that I can produce if the original is
lost or damaged.
I hereby certify that no part of this assignment/product has been copied
from any
other students work or from any other source except where due
acknowledgement
is made in the assignment. No part of this assignment/product has been
written/produced for me by another person except where such
collaboration has been authorised
by the subject lecturer/tutor concerned.
Signature: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
(Note: An examiner or lecturer/tutor has the right not to mark this
assignment if the
above declaration has not been signed)
Task 1 Task 2 Task 3 Task 4 Bonus Total
Mark
Possible 15 10 15 10 ? 50
The maximum points possible for this assignment is 50 (including any
bonus points).
8