python代写-CS 486/686|学霸联盟

python代写-CS 486/686

时间：2022-03-28

Reinforcement Learning - Part 1
Blake VanBerlo
Lecture 20
Readings: RN 22.1 - 21.3. PM 12.1, 12.5, 12.8.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 1 / 18
Outline
Learning Goals
Introduction to Reinforcement Learning
Passive Adaptive Dynamic Programming
Active Adaptive Dynamic Programming
Revisiting the Learning goals
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 2 / 18
Learning Goals
By the end of the lecture, you should be able to
▶ Trace and implement the passive adaptive dynamic
programming algorithm.
▶ Explain the trade-off between exploration and exploitation.
▶ Trace and implement the active adaptive dynamic
programming algorithm.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 3 / 18
Learning Goals
Introduction to Reinforcement Learning
Passive Adaptive Dynamic Programming
Active Adaptive Dynamic Programming
Revisiting the Learning goals
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 4 / 18
A Reinforcement Learning Agent
Let’s consider fully observable, single-agent reinforcement learning.
We will formalize this problem as a Markov decision process.
▶ Given the set of possible states S and the set of actions A.
▶ Observes the state and the rewards received.
▶ Carries out an action.
▶ Goal is to maximize its discounted reward (i.e. return).
Why is reinforcement learning challenging?
▶ Which action was responsible for this reward/punishment?
▶ How will this action impact my utility?
▶ Should I explore or exploit?
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 5 / 18
Learning Goals
Introduction to Reinforcement Learning
Passive Adaptive Dynamic Programming
Active Adaptive Dynamic Programming
Revisiting the Learning goals
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 6 / 18
Passive Learning Agent
▶ Fix a policy π.
▶ Goal is to learn V π(s) (expected value of π for state s).
▶ Similar to policy evaluation.
▶ Does not know the transition model P (s′|s, a)
nor the reward function R(s).
▶ Solution: Adaptive Dynamic Programming
▶ Learn P (s′|s, a) and R(s) using the observed transitions and
rewards.
▶ Learn V π(s) by solving Bellman equations (exactly or
iteratively).
V π(s) =R(s) + γ
∑
s′
P (s′|s, π(s))V π(s′).
▶ A model-based approach - uses model of environment
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 7 / 18
Passive ADP Algorithm
1. Repeat steps 2 to 5.
2. Follow policy π and generate an experience ⟨s, a, s′, r′⟩.
3. Update reward function: R(s′)← r′
4. Update the transition probability.
N(s, a) = N(s, a) + 1
N(s, a, s′) = N(s, a, s′) + 1
P (s′|s, a) = N(s, a, s′)/N(s, a)
5. Derive V π(s) by using the Bellman equations.
V (s) = R(s) + γ
∑
s′
P (s′|s, π(s))V (s′)
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 8 / 18
Passive ADP Example
s11 +1
s21 −1
▶ π(s11) = down, π(s21) = right
▶ γ = 0.9
▶ R(s11) = −0.04, R(s21) = −0.04, R(s12) = 1, R(s22) = −1
▶ N(s, a) = 5, ∀s, a.
▶ N(s, a, s′) = 3 for the intended direction.
▶ N(s, a, s′) = 1 for a direction to the left or right of the
intended direction.
▶ The current state is s11.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 9 / 18
Passive ADP Example continued
s11 +1
s21 −1
1. No need to update the reward function.
2. Update the counts.
N(s11, down) = 6 and N(s11, down, s21) = 4.
3. Solve the Bellman equations.
V (s11) = −0.04 + 0.9(0.667V (s21) + 0.167(1) + 0.167V (s11))
V (s21) = −0.04 + 0.9(0.6(−1) + 0.2V (s11) + 0.2V (s21))
The solutions are:
V (s11) = −0.4378, V (s21) = −0.8034
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 10 / 18
Q: Updating the transition probabilities
Q #1: Suppose the agent is in state s23. Presently,
N(s23, down) = 23 and N(s23, down, s24) = 2.
The agent tries to move down, but accidentally moves right. After
this experience, what is P (s24|s23, down)?
1 2 3 4
1
2 X -1
3 +1
(A) 0.087
(B) 0.1
(C) 0.125
(D) 0.13
(E) 0.667
→ Correct answer is (C). P (s24|s23, down) = 3/24 = 0.125.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 11 / 18
Learning Goals
Introduction to Reinforcement Learning
Passive Adaptive Dynamic Programming
Active Adaptive Dynamic Programming
Revisiting the Learning goals
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 12 / 18
Active ADP
The passive ADP agent learns the expected value of a fixed policy.
What action should the agent take at each step?
Two things are useful for the agent to do:
1. exploit: take an action that maximizes V (s).
2. explore: take an action that is different from the optimal one.
→ Actions serve two purposes: (1) provide rewards,
(2) gather more data to learn better model. (long-term benefits)
The greedy agent seldom converges to the optimal policy and
sometimes converges to horrible policies because the learned model
is not the same as the true environment.
There is a trade-off between exploitation and exploration.
→ Check out Multi-Arm Bandit Problems
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 13 / 18
Tradeoff Between Exploitation and Exploration
1. ϵ-greedy exploration strategy:
▶ Select random action with probability ϵ, and
▶ Select the best action with probability 1− ϵ.
▶ We can decrease ϵ over time.
2. Softmax selection using Gibbs/Boltzmann distribution.
▶ Choose action a with probability e
Q(s,a)/T∑
a e
Q(s,a)/T
.
▶ T > 0 is the temperature. When T is high, the distribution is
close to uniform. When T is low, the higher-valued actions
have higher probabilities.
3. Initialize the values optimistically to encourage exploration.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 14 / 18
Optimistic Utility Values to Encourage Exploration
We will learn V +(s) (the optimistic estimates of V (s)).
V +(s)← R(s) + γmax
a
f
(∑
s′
P (s′|s, a)V +(s′), N(s, a)
)
f(u, n) =
{
R+, if n < Ne
u, otherwise
f(u, n) trade-offs exploitation and exploration.
▶ R+ is the optimistic estimate of the best possible reward
obtainable in any state.
▶ If we haven’t visited (s, a) at least Ne times, assume its
expected value is R+. → Make state attractive for
exploration initially.
▶ Otherwise, use the current utility value (V +(s)).
→ f should be increasing in u and decreasing in n.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 15 / 18
Active ADP Algorithm
1. Initialize R(s), V +(s), N(s, a), N(s, a, s′).
2. Repeat steps 3 to 7 until we have visited each (s, a)
at least Ne times and the V
+(s) values converged.
3. Determine the best action a for current state s using V +(s).
a = argmax
a
f
(∑
s′
P (s′|s, a)V +(s′), N(s, a)
)
, f(u, n) =
{
R+, if n < Ne
u, otherwise
4. Take action a and generate an experience ⟨s, a, s′, r′⟩
5. Update reward function: R(s′)← r′
6. Update the transition probability.
N(s, a) = N(s, a) + 1, N(s, a, s′) = N(s, a, s′) + 1
P (s′|s, a) = N(s, a, s′)/N(s, a)
7. Update V +(s) using the Bellman updates.
V +(s)← R(s) + γmax
a
f
(∑
s′
P (s′|s, a)V +(s′), N(s, a)
)
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 16 / 18
An Active ADP Example
s11 +1
s21 −1
▶ π(s11) = down, π(s21) = right
▶ γ = 0.9
▶ Ne = 10, R+ = 5.
▶ R(s11) = −0.04, R(s21) = −0.04, R(s12) = 1, R(s22) = −1
▶ N(s, a) = 5,∀s, a.
▶ N(s, a, s′) = 3 for the intended direction.
▶ N(s, a, s′) = 1 for any other direction with positive transition
probability.
The current estimates:
V +(s11) = −0.6573, V +(s21) = −0.9002
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 17 / 18
Revisiting the Learning Goals
By the end of the lecture, you should be able to
▶ Trace and implement the passive adaptive dynamic
programming algorithm.
▶ Explain the trade-off between exploration and exploitation.
▶ Trace and implement the active adaptive dynamic
programming algorithm.
CS 486/686: Intro to Artificial Intelligence Lecturer: Blake VanBerlo Slides by: Alice Gao 18 / 18