Proposal for selecting a cooperation partner in distributed control of traffic signals using deep reinforcement learning

Traffic signal control is one way to alleviate traffic congestion on road networks. The main method of traffic signal control is a distributed control method in which signals cooperate locally. In this study, to realize more effective control in the distributed control system, we propose a guideline for selecting the cooperation partner of each traffic signal and verify its effectiveness. In this study, the traffic signal is controlled by applying deep reinforcement learning, which is a machine-learning algorithm.


Introduction
One way to alleviate frequently occurring traffic congestion is to control the traffic signals. In various existing studies, reinforcement learning, which is a machine learning algorithm, has been applied for traffic signal control (1) .
There are two main methods of traffic signal control: centralized control and distributed control. Centralized control collectively controls all traffic signals existing in the entire target road network at the control center. However, the amount of calculation required for control increases significantly as the number of intersections to be handled increases. Furthermore, the equipment failure at a control center can put the entire target road network in risk.
In distributed control, the traffic signals cooperate locally to obtain the optimum phase of each traffic signal.
This study focuses on the distributed traffic signal control method. While using distributed control, it is often the case that the cooperation partner comprises all the adjacent traffic signals. However, as there are often differences in traffic volume on each road, it is possible to learn more effective control by limiting the cooperation partners than by cooperating with all adjacent traffic signals.
We have been studying the cooperation control of traffic signals using deep reinforcement learning (2) . Our previous study (2) shows the effectiveness of cooperating only with the adjacent traffic light with the largest inflow traffic in the mesh-shaped road network.
In this study, we first introduce the method proposed in our study (2) . Furthermore, we propose a method for selecting a cooperation partner more flexibly and verify its effectiveness. In this study, it is not a direct cooperation method, such as sharing the indication state of the adjacent traffic signal, but indirectly adjacent traffic signals, by recognizing the traffic condition on the road flowing into oneself.
In this study, the control of traffic signals is learned using Deep Q-Network (DQN) (3) , a deep reinforcement learning method. The shape of the road network used in this study is a mesh.

Related works
Hongwei et al. (4) proposed QT-CDQN (cooperative deep Q-network with Q-value transfer) for cooperated traffic signal control. This includes not only the Q value of one's own traffic signal, but also the Q value of an adjacent traffic signal. Misawa et al. (5) performed cooperation control in an environment where the traffic volume on all routes was not constant. Therefore, the ratio of straight ahead at intersections increases. Suzuki et al. (6) alleviated traffic congestion at intersections when the traffic volume was equal on each road, considering that the degree of time loss and economic loss differ depending on the vehicle type. Diamantis et al. (7) performed adaptive fine-tuning (AFT) with real traffic and demonstrated its effectiveness. Tian et al. (8) applied Coder, a deep reinforcement learning algorithm, for traffic signal control. Min et al. (9) applied the A2C algorithm for traffic signal control and performed it on an existing road network. Daeho et al. (10) predicted future traffic conditions and performed a cooperation control. Deepak et al. (11) proposed cooperative multi-agent reinforcement learning-based models for traffic control optimization.
However, a large number of existing studies focus on cooperating with all adjacent traffic signals. Few studies have shown the effectiveness of selecting partners.

Deep Q-Network
Reinforcement learning seeks a policy to maximize the rewards obtained by a learner (agent) who acts on the environment. In reinforcement learning, the optimal policy is obtained by repeating the following three steps: (1) the agent takes action, (2) the state of the environment is updated by the action of the agent, and in some cases, a reward is obtained for the action; and (3) the agent modifies the policy based on the reward. Reward is an index showing the goodness of an agent's action and environmental state. A policy is a rule that serves as an index when an agent acts.
In Q-learning (12) , the value of an agent's action is managed in the Q table, and the Q value is updated each time the action is taken. The Q value indicates a value that takes into account future rewards. Deep Q-Network (DQN) (3) extends Q-learning by including neural networks in Qlearning. Specifically, a neural network is used to approximate the Q value of each action with respect to the state of the environment. Then, if the Q value for each action can be estimated in a certain state, the action with the best Q value, that is, the best action to be taken, can be known. The Q value is calculated as the expected value of the discounted income of the rewards that can be obtained in the future. The updated formula for the Q value is given by Eq. (1): Where is the action value, is a state, is the current value of a future reward, is an action, is a reward, and is a learning parameter of a neural network. Experience replay and target network methods are used in the DQN to stabilize learning. Experience replay accumulates past experiences in an array called replay memory, selects from the accumulated experiences randomly, and updates the Q value as supervised data. The target network separates the parameters of the Q function of the target and the Q function to be updated. This stabilizes the learning by fixing the target for some time.
It is difficult for Q-learning to express continuous values. However, because DQN is based on the idea of obtaining an approximate function of the Q value by using a neural network, it is possible to deal with continuous values.

Applying DQN
We describe the cooperation method proposed in our previous study (2) . We then propose a method for selecting a cooperation partner more flexibly. The "action selection", "state recognition", and "reward design" are the same as in Reference (2) .

Agent
The traffic signal at each intersection is treated as an agent. Each traffic signal agent selects an action after perceiving the state and receives a reward from the environment depending on the result of the action performed. One cycle is defined as a step.

Action selection
The actions that the traffic signal agent can select are two types of phases: "north-south is red light + east-west is green light" and "north-south is green light + east-west is red light". When switching phases, the yellow signal phase is inserted regardless of the intention of the traffic signal agent.

State recognition
The traffic signal agent recognizes five items: the traffic signal phase within one step range, the duration of the phase [s], the traffic density [vehicles/km], the number of waiting vehicles [units], and the average speed [km/h]. The traffic signal itself has two phase patterns: "green light for north and south + red light for east and west" and "red light for north and south + green light for east and west". The duration of the phase indicates the time during which "north-south is green light + east-west is red light" or "north-south is red light + east-west is green light". The traffic density, number of waiting vehicles, and average speed indicate the density of roads, the number of vehicles stopped, and the average speed in each of the four directions. The recognition ranges are shown in Figure 1.

Reward design
Reward was set using ′, which is normalized to a value of approximately 0 to 1 for an average speed of all

Selection of cooperation partner
The degree of variation in traffic volume throughout the road network varies. In other words, the degree of variation in the amount of traffic flowing into each intersection varies. However, as in our previous study (2) , fixing only the adjacent intersections of the lanes, which has the largest inflow, as a cooperation partner does not take this variation into consideration. Therefore, to select the cooperation partner more flexibly, the selection is made based on the ratio of the traffic volume flowing into a traffic signal. First, we find the ratio of traffic volume flowing in from the four directions. Then, the traffic signal cooperates only with the traffic signal of the lane of the traffic volume exceeding the preset threshold value. If there are no lanes with traffic exceeding the threshold, it cooperates with traffic signals in all four directions.

Experimental environment
An open-source micro traffic flow simulator, Simulation of Urban MObility (SUMO) (13) , which provides API and GUI, was used as the simulation environment.
As shown in Fig. 2, the shape of the road network is a 4 × 4 mesh, and the length of one side is 400 [m]. Therefore, the length of one continuous road was 2 [km].
To congest some routes, vehicles appeared with a probability of 0.9, in places with heavy traffic. Furthermore, the number of vehicles that appeared with a probability of 0.9 was set to appear up to 100 in each step. Other vehicles appear with a probability of 0.0011 for each step. All vehicles emerge from the edge and disappear when they reach the edge.
When the vehicle travels to the destination, the shortest route is followed to reach the destination. Therefore, it is possible to detour and arrive at the destination.
In Figure 2, A to F indicate columns, and 0 to 5 indicate rows. Each intersection is written as A1, A2, A3, .... using row and column symbols, respectively. The arrows indicate roads with heavy traffic.
The following three roads were set as roads with heavy traffic. (1) The road that first passes through the traffic signal at intersection B4 and passes through intersections C4, C3, C2, and C1. (2) The road that first passes through the traffic signal at intersection B4 and passes through intersections C4, C3, D3, and E3. (3) The road that first passes through the traffic light at intersection B4 and passes through intersections C4, C3, D3, D2, and D1. In addition, 90 [%] of the vehicles passing through traffic light D3 proceed to E3, and 10 [%] proceed to D2 and pass through D1.
The road is intended as a general road, but considering its application in the international road network, the speed limit of general roads is considered to be up to 90 [km/h].
At each intersection, the vehicle is set so that it can go straight, turn left, turn right, and make a U-turn.
Agents are set for all traffic signals, resulting in a total of 16 traffic signal agents. The maximum time that a traffic signal can continue in the red and green phases is 50 [s], and the minimum time is 5 [s]. When switching phases, a yellow signal phase of 2 [s] is always inserted regardless of the intention of the traffic signal agent.

Experimental scenario
Two experimental scenarios were set.

Parameter setting
The neural network that calculates the Q value has three hidden layers consisting of 256 nodes, with the nodes between each layer are fully connected. The input is fourdimensional, whereas the output is two-dimensional. We used an Adam optimizer (14) as a gradient descent method to minimize the error of the Q value, with the parameters following Adam's Reference (14) . In addition, the ε-greedy method was adopted for the selection of actions. was linearly reduced to 0.02 after 10000 steps from the initial value. The remaining parameters follow the DQN Reference (3) .
One cycle is defined as a step, and the action in one step is assumed to be executed continuously for 5 [s].
The simulation time was set to 10000 [s]. We waited for 500 [s] before starting learning so as to let enough vehicles enter the road network. This procedure was repeated 1000 times.

Experimental results and discussion
Two executions of scenario 1 and scenario 2 were executed 10 [times] each. Table 1 shows the cooperation partners in one of the ten executions in Scenario 1. Table 2 shows the execution result of scenario 1, and Table 3 shows the execution result of scenario 2. These two tables show the average speed and reward for the final result of the run. Figure 3 shows the average curve at the average velocities of scenario 1 and scenario 2. The average speed is the average speed of the entire road network in 10 executions. The vertical axis shows the average speed [km/h], and the horizontal axis shows the number of learnings. The blue line shows the learning results in Scenario 1, and the orange line shows the learning results in Scenario 2. As shown in Table  2, Table 3, and Figure 3, the learning result in Scenario 1 is superior. Furthermore, it can be seen that in the early stages Table 1. Cooperation destination of each intersection Table 2. Execution results of scenario 1 Table 3. Execution results of scenario 2 Fig. 3. Average speed of learning, Scenario 1 learns faster than Scenario 2. From these experimental results, it can be seen that it is important not to cooperate in consideration of the conditions of all adjacent roads, but to appropriately select the cooperation partner.
In this experiment, the average speed of the vehicle could be improved by selecting a cooperation partner. This can be attributed to the increase in the average speed of roads with heavy traffic by recognizing the state of only the road with heavy traffic. Congestion on high-traffic roads has a greater impact on the entire road network than on low-traffic roads. Therefore, the average speed of the entire network is increased by smoothing the traffic flow on the road with heavy traffic.

Conclusion
In this study, we proposed a method for selecting a cooperation partner in distributed control of traffic signals using deep reinforcement learning. The proposed method cooperated only with the adjacent traffic signals in the lanes with traffic volumes whose ratio to the total traffic flowing into the traffic signals exceeded a preset threshold. To evaluate the effectiveness of the proposed method, a simulation experiment was conducted in a 4×4 signal network with increasing traffic volume on a specific road. The experimental results revealed that the proposed method improved the average speed of the road network by approximately 1.7 [%]. This result can be attributed to effective traffic control on a busy road. Thus, the effectiveness of our proposed method was verified.
Although a threshold value was set in the proposed method, a method for setting dynamic threshold value depending on the traffic volume should be established in future work.