python代写-2017S1

时间：2021-07-20

Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 1
Reinforcement Learning for Intelligent Traffic Light Control
COMP9417 Machine Learning Project
Martin Yang Hu, z3463114

Introduction
As some of the most populous cities in today’s world continue to grow, they will need to either expand their existing road networks
or improve road infrastructure to accommodate more traffic. One of the most effective ways to alleviate congestion and allow
more road users is intelligent traffic light control. Constant phase control implementations may be effective enough when
configured correctly for the given environment, but are rarely optimal. More modern implementations may incorporate the use
of physical actuators, such as weight sensors or induction loops beneath road surfaces, which allow basic logic to increase the
efficiency of the traffic light control. However, the addition of these physical installations would be time-consuming, expensive,
and would not be able to dynamically adapt to changing environments or traffic situations. The purpose of this study is to design
an intelligent traffic light control system, and evaluate its performance in a simulated traffic environment. The specific method
used for learning in this study is the Q-learning algorithm with epsilon greedy exploration.

Video of Simulation at https://youtu.be/pVjGYQ03euk

Related Work
Over recent years there has been consistent ongoing research in adaptive traffic control, especially with the use of temporal
difference reinforcement learning methods such as Q-learning. A study conducted by Steingrover has demonstrated the
practicality of deploying reinforcement learning traffic control systems in networks of sixteen intersections which share a common
knowledge base (Steingrover 2005). When each local agent is provided full perfect information about the entire network and
other agents’ decision processes Q-learning has been shown to perform especially well. Many other recent studies have utilised
simulations to examine the effectiveness of Q-learning or other reinforcement learning algorithms in traffic light control, but it is
not clear whether these theoretical concepts will be able to be deployed or tested in the real world anytime soon (Abdulhai 2003).
Mannion actually writes about the inherent difficulties associated with migrating these reinforcement learning traffic control
systems from a virtual simulated world to the real physical world, and other practical differences and concerns (Mannion 2015).
Providing learning agents with perfect real-time information about the world around it is relatively simple to do within a
simulation, but attempting to do the same in the real world with today’s technology would require a complex array of sensors and
other hardware systems which would be very costly to calibrate, operate and maintain.

Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 2
Implementation
The traffic control simulation software used in this study is self-coded using Python 3.5. The source code has been provided along
with this study, and the only external module required to run the program is Pygame. The simulated environment involves
numerous four-way intersections with traffic moving in both directions in multiple lanes. Each intersection allows traffic to flow
in either the horizontal or vertical direction, but not both. Traffic can flow when the respective light signal is green, and is stopped
when the light signal is red. When switching from green to red, an amber signal will be temporarily given to allow for cars that are
currently travelling within the intersection to complete their trip to the other side. The number of intersections and number of
lanes can both be changed in order to evaluate the performance of the reinforcement learning algorithm. The function which
generates new cars for each lane is given by:
() = [([0,10)) + 10 − ] == 0
where is the intensity factor ranging from 0 to 10, is the timestep, and a new car is generated when () .
For a video of the simulation in action, please go to https://youtu.be/pVjGYQ03euk.
Q-learning is a model-less temporal difference control algorithm developed by Watkins in 1989, which attempts to learn ‘quality’
values (, ) for each action given state within the state-space (Watkins 1989). By exploring states, actions, and rewards, it
is able to estimate the true optimal action-selection function for any finite Markov decision process. It has been proven to
converge, in certain specifications, to the optimal action-selection policy after each state-action pair (, ) has been visited an
infinite number of times. The one step update of the implemented Q-learning algorithm is:
(, ) = (, ) + ( + max
′
(′, ′) − (, ))
where is the current state, ′ is the future state reached by performing action in the current state , (, ) is the value of
taking action in the current state , max
′
(′, ′) is the maximum of the values of taking all possible actions ′ in the future
state ′, is the learning rate, is the discount factor, and is the future calculated reward.
In this study, the Q-learning algorithm operates as follows. First, a table Q which represents all of the state-action values (, )
is initialised with all values set to 0. At each iteration, to encourage exploration, the algorithm decides with probability to
completely randomise the action which is chosen given the current state . With probability 1 − , the action with the highest
corresponding Q value is chosen. If there are multiple actions with the highest Q value, we randomly choose one of them. is the
initial probability to randomise or ‘explore’, is the probability decay factor slightly less than 1 which allows for increased
exploitation of the acquired knowledge, and is the current timestep. The chosen action is then performed, and the reward
measured. We then update (, ) for the state we had just considered and the action we had just performed using the update
formula described above. We use the new state ′ as the current state and repeat this process.

Each state within the Q-learning algorithm was defined using:
x Sum of all horizontal lanes heading East where each lane is equal to closest car position from intersection (0-8, 9 if no
cars)
x Sum of all vertical lanes heading South where each lane is equal to closest car position from intersection (0-8, 9 if no cars)
x Sum of all horizontal lanes heading West where each lane is equal to closest car position from intersection (0-8, 9 if no
cars)
x Sum of all vertical lanes heading North where each lane is equal to closest car position from intersection (0-8, 9 if no cars)
x Light setting (0 green horizontal, 1 green vertical, 2 amber)
x Delay setting (0-(3 + time needed for cars to cross from one side of intersection to the other))
I initially considered adding a new state dimension per lane, but determined it to be infeasible without some kind of dimensionality
reduction, for example using function approximation, as the resulting table would be far too large to process.
Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 3
Results
The performance measure for all of my experiments is the sum of the number of queued (stopped) cars in all lanes within each
timestep. I then further evaluate this performance measure as an exponential moving average with a smoothing factor = 1 −
−
1
where is the scaling constant and is equal to 300 iterations. The exponential moving average is calculated using the following
equation:
= + (1 − )−1
where is the new exponential moving average at timestep , is the number of stopped cars during this timestep, and −1 is
the exponential moving average at the previous timestep − 1.
I began my experimentation with the Q-learning agent by playing around with the learning parameters. By first finding the optimal
learning parameters for this particular use case, namely traffic signal control, we are able to generalise better overall results in
subsequence experiments by sticking to this particular set of parameters. In order to test these parameters in the simulation, I
decided to use 2 lanes to and from each intersection in every direction, and a total of 4 intersections (2 in each row and column),
and averaged all results between 5 runs. The generating function intensity is 5, and the reward function chosen is the negative of
the total number of cars stopped on roads leading to the intersection in either direction or lane. The results below followed from
running the simulation for 10000 iterations.

The learning rate alpha determines how important new information is compared to older information. A learning rate of 0 results
in the agent not being able to learn anything, and it is clear from the results that it does not perform too well in this case. As we
approach a learning rate of 1, the agent considers recent information more and more. From the results above, it is obvious that a
learning rate of 0.3 gives the best outcome on average.

0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Learning Rate
Average Cars Stopped vs Learning Rate
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Discount Factor
Average Cars Stopped vs Discount Factor
Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 4
The discount factor gamma is responsible for adjusting the significance of future rewards. A discount factor of 0 causes the learning
agent to only consider the current rewards, and ignore all future rewards, and the opposite is true with a discount factor of 1,
which results in a rather suboptimal performance. The discount factor which leads to the best performance is 0.7.

The exploration factor epsilon enables state exploration by sometimes allowing the learning agent to randomly choose a next
action to take on instead of the highest valued one. An exploration factor of 0 means that there is no exploration conducted, which
means that decent action paths previously taken are likely to be repeatedly taken again and again even if they are not part of the
optimal route. As shown by the results, this clearly does not lead to optimal results. Neither does an exploration factor of 1, which
causes a random exploration action to be selected every time. It is evident from the results that an exploration factor of 0.1 should
be chosen for our further experiments. Note that in our simulation this exploration factor is actually decayed by a factor of 0.9998
to the power of the number of timesteps undertaken so far. This factor corresponds to lambda in the code, and the reason for its
use is explained in the implementation section above.
Using these learning parameters as the default parameters, I proceeded to evaluate the performance of different reward
functions. The reward (punishment) functions tested were:
A: -1 for any stopped car on roads leading to intersection in either direction or lane
B: -1 * total number of cars stopped on roads leading to the intersection in either direction or lane
C: -1 * sum of squared number of cars stopped in each lane of roads leading to the intersection
I plotted these reward functions in comparison to a baseline which consisted of a non-adaptive basic algorithm which alternated
light signals every 10 timesteps. The generating function intensity was kept at 5 during all of the runs, and each simulation run
lasted 10000 iterations.

0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Exploration Factor
Average Cars Stopped vs Exploration Factor
0.0
0.5
1.0
1.5
2.0
2.5
3.0
10
00
14
00
18
00
22
00
26
00
30
00
34
00
38
00
42
00
46
00
50
00
54
00
58
00
62
00
66
00
70
00
74
00
78
00
82
00
86
00
90
00
94
00
98
00
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Timesteps
1 Lane 1 Intersection
Baseline A B C
Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 5

From the 1 lane results, it is clear that all of the Q-learning algorithms eventually performed better than the naïve constant light
switching approach. Reward function B resulted in a consistently lower number of cars stopped than A, and was for the most part
also better than C. Using a reinforcement learning algorithm to adaptively control traffic lights appears to have a valuable effect.

The 2 lane results tell a similar story, with B consistently outperforming A and C, and by extension the baseline too. It can be seen
that incorporating 2 lanes instead of 1 has also made the graphs more jittery, reflecting additional volatility in the number of cars
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
10
00
14
00
18
00
22
00
26
00
30
00
34
00
38
00
42
00
46
00
50
00
54
00
58
00
62
00
66
00
70
00
74
00
78
00
82
00
86
00
90
00
94
00
98
00
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Timesteps
1 Lane 4 Intersections
Baseline A B C
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
10
00
14
00
18
00
22
00
26
00
30
00
34
00
38
00
42
00
46
00
50
00
54
00
58
00
62
00
66
00
70
00
74
00
78
00
82
00
86
00
90
00
94
00
98
00
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Timesteps
2 Lanes 1 Intersection
Baseline A B C
0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
10
00
14
00
18
00
22
00
26
00
30
00
34
00
38
00
42
00
46
00
50
00
54
00
58
00
62
00
66
00
70
00
74
00
78
00
82
00
86
00
90
00
94
00
98
00
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Timesteps
2 Lanes 4 Intersections
Baseline A B C
Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 6
stopped. It is now fairly obvious that using a temporal difference reinforcement learning algorithm such as Q-learning is definitely
valuable in reducing traffic in a road network, at the very least in a simulated environment.

After running the simulation with varying intensities of the generating function using the various reward functions compared to
the baseline, it becomes clear that the Q-learning traffic control system scales with increased traffic in a similar fashion to a non-
adaptive implementation. This is perhaps indicative of the fact that even if a learning system manages to approach the optimal
action function, the fundamental limit may still be an issue. This means that no matter how intelligent a traffic system is, if there
is an extremely large amount of traffic on the roads, then extended delays will be unavoidable. This seems to suggest that
intelligent traffic signals will only be a temporary solution to the real world problem of ever increasing traffic on our roads.

Conclusion
This study was completed in order to determine the feasibility and value of applying a model-less temporal difference
reinforcement learning algorithm to traffic light control. The traffic lights at each intersection within the coded simulation were
controlled by an independent Q-learning agent, which was able to work its way towards an optimal light switching action policy.
This study further confirms that Q-learning shows promise for solving traffic light issues and improving overall efficiency in the
real world. However, such an application is extremely sensitive to parameter changes, and together with inherent difficulties with
moving such an algorithm out of a simulation into the real world makes it a non-trivial and costly solution to implement. And even
after it has been eventually deployed in the real world, such a solution does not solve the traffic problem indefinitely, as there is
still a fundamental limit to how much traffic can be alleviated.

Future Work
This study has tested a generic reinforcement learning algorithm on a simulation which can be expanded in a number of areas.
After further work in the future, the simulation should be able to model lane changes and turns at intersections, as well as
modelling of variable speeds and car spacing, perhaps at a continuous or non-discrete level. More stochastic processes should be
implemented, including that of pedestrians crossing, weather effects, visibility changes, and/or modelling of accident aftermaths.
Various other temporal difference reinforcement learning algorithms, such as SARSA (Rummery 1994), should be experimented
with, together with the use of eligibility traces to speed up the learning process. In order to process more complex, higher
dimensional or even continuous state-spaces, one would later require the use of function approximation methods such as k-means
clustering or neural networks to allow computation to be feasible.
Conducting experiments within a simulation can also be fairly limiting. Future work will need to be done in controlled real world
experiments before we can safely deploy such algorithms out in the field. However, the physical instruments which would be
required to allow intersections to gain access to the real time data required to operate a machine learning traffic light system
would be rather costly and difficult to set up, thus making it uncertain whether we will be able to evaluate such systems in the
near future.

0.0
5.0
10.0
15.0
20.0
25.0
30.0
Baseline A B C
Ex
po
ne
nt
ia
l A
ve
ra
ge
C
ar
s S
to
pp
ed
Reward Functions
Average Stopped Cars for Varying Intensities
1 2 3 4 5 6 7 8
Martin Yang Hu, z3463114 COMP9417, 2017S1
Page 7
References
Abdulhai, B., Pringle, R., Karakoulas, G.J. 2003, ‘Reinforcement learning for true adaptive traffic signal control’, Journal of
Transportation Engineering, vol. 129, no. 3, pp. 278-285.
Mannion, P., Duggan, J., Howley, E. 2015a, ‘An experimental review of reinforcement learning algorithms for adaptive traffic
control’, Autonomic Road Transport Support Systems, Autonomic Systems, Birkhauser/Springer.
Rummery, G.A., Niranjan, M. 1994, ‘Online Q-learning using connectionist systems’, Cambridge University Engineering
Department.
Steingrover, M., Schouten, R., Peelen, S., Nijhuis, E., Bakker, B. 2005, ‘Reinforcement learning of traffic light controllers adapting
to traffic congestion’, BNAIC, pp. 216-223.
Watkins, C.J.C.H. 1989, ‘Learning from delayed rewards’, PhD thesis, University of Cambridge England.

学霸联盟