xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

Java代写 - Reinforcement learning

时间：2020-10-26

1 General idea
In this assignment, you will write a program that uses reinforcement learning to learn the correct
policy in a Markov Decision Process.
The input to the program is a Markov Decision Process. Rather than solve this, however, the
program will learn what to do through experimentation. It will play many rounds. In each round it
starts in a random state. It then goes through the process: at each state it chooses an action, and
then the stochastic process puts it in a new state. When it reaches a terminal state, that round is
over. It now records, for each state and choice of action that it went through, how well that choice
of action worked out for it.
Initially it chooses the action at random, but as it plays more and more round and gets more and
more experience, it increasingly chooses the actions that in the past have paid o↵ for it.
Conceptually, there is an agent and a world model. The world model randomly chooses a starting
state and executes the Markov decision process each time the agent chooses an action. The agent
chooses actions and gradually learns what actions are best. You are not required to implement these
as separate modules, though you might wish to.
2 Markov decision process
There are n non-terminal states, numbered 0 to n!1 and t terminal states, numbered n to n+t!1.
In each state, there is at least one action, numbered consecutively from 0.
The transition probabilities for any given action in any given situation are given in the program
input.
In each round, the agent starts at a random starting state. The agent repeatedly chooses an action
according to a formula discussed in section 6, and the world model moves them to a new state, until
the agent reaches a terminal state, when the round ends. When the round ends, the agent updates
the information about the states traversed and the actions chosen as discussed below in section 4.
3 Input
The input is from a text file or standard input.
The first line consists of five parameters:
1. The number of non-terminal states.
2. The number of terminal states.
3. The number of rounds to play,
4. The frequency with which to print output (section 5).
1
5. The parameter M (an integer) controlling the “explore/exploit” trade-o↵ discussed in section 6
The second line specifies the rewards in the terminal states, alternating state number with its reward
(an integer).
The remaining line specify the transition probabilities for actions in non-terminal states. Each line
begins with the number of a state, colon, the number of a action followed by alternating a number
of a state number and the probability of transitioning to that state.
3.1 Example Input
For instance, here is the input for the simple process above (similar to the one in problem set 5, but
with numbers on the state, and with di↵erent rewards):
2 2 3 2 10
2436
0:0 1 0.1 2 0.4 3 0.5
0:1 1 0.6 2 0.1 3 0.3
1:0 0 0.8 2 0.1 3 0.1
1:1 0 0.1 2 0.4 3 0.5
You may assume that the input is correctly formatted and describes a valid MDP. You may also
assume that from any starting state, any policy has probability 1 of eventually reaching a terminal
state, so you don’t have to worry about a round going into an infinite loop.
Self-loops are possible; that is, there may be an action that has some chance of leaving the agent in
the same state.
2
4 Learning
Set up a data structure that records for each non-terminal state S and action A: • Count[S,A] The number of rounds so far that the agent has been in state S and chosen A. • Total[S,A] The total reward of those rounds.
During each round, keep a data structure that tracks the states that have been encountered and
the actions taken. (Do not count duplicates.) At the end of the round, for each state-action S,A
encountered, increment Count[S,A] by 1 and increase Total[S,A] by the reward for the round.
For instance, suppose that we play three rounds.
The first round is the sequence 0:1, 1:0, 3 (that is, the agent starts at state 0; it take action 1;
that takes it to 1; it takes action 0; that takes it to the terminal state 3 where it gets a reward of 6.)
Count[0,1] and Count[1,0] will be set to 1. Total[0,1] and Total[1,0] will be set to 6.
The second round is the sequence 1:1, 0:1, 1:1, 2 Count[1,1] will be incremented by 1 (not 2,
even though it appears twice in the round. Count[0,1] will be incremented by 1. Total[0,1] and
Total[1,1] will be increased by 4, the payo↵ for the round.
The third round is the sequence 1:0, 0:1, 1:0, 0:0, 3 Count[1,0], Count[0,1], and Count[0,0]
will be incremented by 1. Total[1,0], Total[0,1], and Total[0,0] will be increased by 6.
So at the end the three rounds,
Count[0,0]=1; Count[0,1]=3; Count[1,0]=2; Count[1,1]=1.
Total[0,0]=6; Total[0,1]=16; Total[1,0]=12; Total[1,1]=4.
5 Output
The amount of output is determined by the frequency parameter v in the input. If v = 0, then the
output should be printed only after all the rounds are complete; otherwise, output should be printed
after every v rounds.
When output is printed, it should show:
The current values of the Count and Total matrices. The best move in each state S, as determined
by the largest value of Total[S,A] / Count[S,A]. If Count[S,A] is 0 for any action A, output U,
for unknown.
5.1 Example Output
In the example above, the output would be:
After 2 rounds
Count:
[0,0]=0. [0,1]=2.
[1,0]=1. [1,1]=1.
Total:
[0,0]=0. [0,1]=10.
3
[1,0]=6. [1,1]=4.
Best action: 0:U. 1:0.
After 3 rounds
Count:
[0,0]=1. [0,1]=3.
[1,0]=2. [1,1]=1.
Total:
[0,0]=6. [0,1]=16.
[1,0]=12. [1,1]=4.
Best action: 0:0. 1:0.
6 Choosing the Action
When one is combining actions with learning in an unknown environment, there is generally an
“explore/exploit” trade-o↵ to consider. That is, how much e↵ort should you spend exploring all the
possibilities, versus exploiting actions that have been found to be successful. Generally, when you
are starting in a new environment, you want to favor exploring; as you get to know the environment,
you increasingly favor exploiting it. For instance, if you move to a new city and enjoy eating out,
then at first, you try lots of possibilities; after you’ve been there for a while, you increasingly go to
old favorites.
In this assignment, the way in which the action is chosen in a state goes from exploring (choosing
an action at random) to exploiting (choosing the best action) gradually, at a speed controlled by
the hyperparameter M; the larger M, the longer it takes to change from one regime to the other.
Your program should choose the action according to the following algorithm:
chooseAction(State s; int[][] count; int[][] total; int M) { % M is the hyperparameter in the input;
n = number of actions feasible in s;
if (there are any actions u such that count[a,u]==0)
return u; % choose an untried action arbitrarily;
for (i=0 to n-1) avg[i] = total[s,i]/count[s,i]; % average reward
bottom = the smallest reward of any terminal node;
top = the largest reward of any terminal node;
for (i=0 to n-1) savg(i) = 0.25+0.75*(avg[i]-bottom)/(top-bottom)
% average scaled to range [0.25,1];
c = Pn!1 i=0 count[s,i];
for (i=0 to n-1) up[i] = savg[i]**(c/M) % unnormalized probability
norm = Pn!1 i=0 up[i];
for (i=0 to n-1) p[i] = up[i]/norm;
return (randomly choose action i with probability p[i]);
}
So, for instance suppose that the program has gone through the three rounds in the above example,
so that
Count[0,0]=1; Count[0,1]=3; Count[1,0]=2; Count[1,1]=1.
Total[0,0]=6; Total[0,1]=16; Total[1,0]=12; Total[1,1]=4.
4
Let M=10. You are now playing a fourth round and the current state is 0. Then:
avg=[6,5.3333]. top=6. bottom=4. savg=[1,0.75].
c=4. up=[10.4, 0.750.4] = [1.0, 0.89]. p = [0.53, 0.47].
So there is a 53% chance of choosing action 0 and a 47% chance of choosing action 1.
The details of the above are not particularly significant, or especially well adapted to this problem;
in fact, I pulled them out of a hat. But the key points are this:
• In the early rounds, c will be much less than M so the exponent c/M will be much less than 1.
When we raise the numbers in savg, which are all values between 0.25 and 1, to the power
c/M, they will all be close to 1. So you are choosing nearly randomly.
• In the late rounds, c will be much greater than M so the exponent c/M will be much greater
than 1. If i is the best move and j is the value of the second-best move, then up[i]/up[j] will
be (savg[ii]/savg[j])**(c/M), which will be very large unless j is nearly as good as i. So with
high probability, we choose the best move, or some move that is almost as good as the best
move (as compared to the range in reward values at the terminal nodes).
Note. If c is much greater than M, then you run into the danger of underflow. If you want to be
really fancy, you can program up an arithmetic exception handler, and change any value that is in
underflow to 0 in up. Or you can write code to catch these some other way. But you are not required
to bother with this; you can assume that the input parameters will be chosen so that it will not run
into underflow.
7 Choosing randomly from a distribution
If you are coding in Python, you can use the method random.choices() described here: https:
//docs.python.org/3/library/random.html
In other languages: If you are given a distribution p0 ...pk, then the way to sample from the numbers
0 ...k with that distribution is to execute the following code: (The function rand() here generates a
random floating point number uniformly distributed between 0 and 1. Any half-decent mathematical
library will supply this with some name.)
u0 = p0
for (i = 1 ...k) ui = ui!1 + pi. x = rand(); % Uniformly distributed between 0 and 1.
for (i = 0 ...k ! 1)
if x