A Reinforcement Learning System with Multi-Layered Fuzzy Neural Network

To define the state of unknown environment for intelligent agent or autonomous robot, many methods such as neural networks, linear models, decision trees, etc have been proposed in the study of reinforcement learning field. In this paper, a self-organized fuzzy neural network with multiple fuzzy inference layers (MLFNN) is proposed. The deeper fuzzy inference layer includes fuzzy membership functions and fuzzy rules as same as the normal fuzzy neural networks, however, it extracts abstract states of the input data. Goal-navigated exploration problem was used in the experiment to confirm the effectiveness of the proposed reinforcement learning system.


Introduction
The intelligent agent or autonomous robot are defined as that these artifacts have the self-learning abilities to adapt to the unknown or unstable environments.To design these artificial intelligent models, reinforcement learning (RL) method (1) , which is a kind of machine learning field, is a powerful tool and showed its attractive achievements recently (2) (3) .In May, 2017, an artificial intelligence (AI) software AlphaGo (3) used RL and decision tree (DT) methods (developed by Google's DeepMind Ltd.) won the strongest human player and professional teams of Go game at China completely.
RL realize adaptive action acquisition according to the exploration and exploitation trials.There are 4 principle elements in RL: state, action, policy, and reward.In RL, a learner observes the state of environment, outputs its action according to the policy (generally stochastic), and obtain the reward/punishment from the environment (the transition process of the states).In this paper, we concentrate the study of state identification.The original information observed by the learner need to be categorized as states of RL, and there have been many proposals such as neural networks ( 1)- (3) , linear models (1) , decision trees (3) , fuzzy inference systems ( 1)- (10) , random tiling (11) , etc.In this paper, we propose a novel multi-layered fuzzy neural network (MLFNN) based on the conventional self-organized fuzzy neural network (SOFNN) proposed in our previous works (5) ( 6)-( 8)- (10)   .The main aim of the proposed MLFNN is to reduce the dimensionality of the states inference by the SOFNN.In MLFNN, some abstract states, which categorized the states given by SOFNN, are designed by designer.The experiment of goal-navigated exploration problem was performed and the effectiveness of the proposed MLFNN was confirmed.
(1) DOI: 10.12792/icisip2017.081Fig. 1.A RL system with FNN (10) Here k i c , k i  denotes the mean and the deviation of ith membership function, which can be interpreted as the center and the width in the RBF nodes, respectively.Let To decide the number of membership functions and fuzzy rules, a self-organizing algorithm was proposed as follows (5) .

Step 1. For each dimension in the input space, only one membership function is generated by the first input data, the value of its membership's mean equals to the value of input data, and the value of deviation of all Gaussian function units is fixed to an empirical value.
Step 2. For the next input   F denotes a threshold value to judge whether the input state is identified precisely by existing membership functions.
Step 3. A new rule is generated automatically when a new membership function is added according to Eq. ( 2).The connections between all membership functions and the new rule are added.
Iteratively, the Fuzzy net is completed to adapt to all input states and can be reconstructed when the new input appears.

MLFNN
In MLFNN, here let the output of fuzzy rule   The fuzzy rules of the second fuzzy net are defined as follows: where R for the output of the first fuzzy net

A RL system with MLFNN
As same as the reinforcement learning (RL) system with SOFNN (See Fig. 1), a RL system with MLFNN (See Fig. 2) can be composed by the fuzzy net and Temporal Difference (TD) learning algorithm, e.g., Q-learning and Sarsa learning.
The output of the fuzzy net are connected to the state-action value function where q=1, 2, …, Q is the number of available actions.The action selection policy is a Boltzmann distribution function: where T is a constant effecting the entropy of the stochastic policy.The learning rule of MLFNN in the case of Sarsa learning is given as follows (proved in reference (10)): r ,  are the learning rate, the reward after action 1  t a and the discount, respectively.In reference ( 8)- (10), the suitable distance between multiple agents was used as a kind of swarm intelligence (12) to improve the learning performance.It is realized by giving positive/negative reward into Eq.( 9), when the distance between multiple independent learners is suitable/not suitable.When the suitable/not suitable distance is considered into the reward, the learning method is called "swarm learning", and oppositely "no swarm learning".We also confirmed the efficiency of swarm learning in the case of RL with MLFNN by the experiment results.Furthermore, in reference ( 7) and (10), learning rate in Eq. ( 8) was adjusted adaptively according to the visited times of the states.More visited state uses smaller value of learning rate.

Input
Fuzzy Net 1 Fuzzy Net 2

Experiments
A goal-navigated problem (Fig. 3) was used in the computer simulation experiment.In Fig. 3 (a), there are 2 agents a, and b at the start positions.The exploration 2-D space has a size of 16x16, and a 4x4 obstacle area in the center of the exploration square, a 4x4 goal area in the opposite location of the start position.The observed data (input to the RL system) were coordinates positions of agents, and the action of agents were the 4 direction movements with 1 step 1 distance (i.e., the output of the RL system with discrete space).In Fig. 3 (b), there are 3 middle goal areas designed for MLFNN's second fuzzy layer.The distances from the centers of R 1 , R 2 , R 3 to the goal are 1 k c =4.95, 2 k c =13.52, 3 k c =17.68, respectively.There were 3 kinds of RL methods were compared in the experiments: Q-table (1) , FNN (10) , and MLFNN (proposed here).
Parameters used in the experiments are shown in Table 1.
In Fig. 4, the last explored paths agents learned by RL with MLFNN were shown.The trajectories of 2 agents kept nearly in the case of swarm learning as shown in Fig. 4 (a) and separately in the case of no swarm learning (Fig. 4(b)).In Fig. 5, the change of the lengths of agents in trials was shown.According to the iterations (horizontal axis indicates trials or iterations), the lengths were shorten in both cases of swarm learning and no swarm learning, however, swarm learning showed better convergence.The change of the number of membership functions of the first layer of MLFNN is shown in Fig. 6, and the number of fuzzy rules of the first layer is shown in Fig. 7.As shown in Fig. 6, both agents explored the maximum height and the maximum width of the environment (16x16) after 400 trials.And from Fig. 7, it can be found that agent 2 experienced fewer positions than agent 1.The learning performance comparison between Q-table, RL with FNN, RL with MLFNN was given by Fig. 8.The proposed method, RL with MLFNN, showed its priorities in both cases of swarm learning and no swarm learning.The average learning costs of 10 experiments (1000 trials per experiment) of these methods are shown in Fig. 9. Swarm learning showed its priority in all cases of methods, and swarm learning of RL with MLFNN had the best performance.

Conclusions
To solve the problem of state explosion (the curse of dimensionality) in the field of reinforcement learning (RL) for multi-agent system (MAS), a multi-layered fuzzy neural network (MLFNN) was proposed in this paper.Based on the self-organizing fuzzy neural network (SOFNN), MLFNN constructs its first fuzzy inference layer to categorize input data to kinds of states, and by inserting a fuzzy inference layer for abstract states (e.g., middle goals),MLFNN outputs fewer states to the state-action value function (Q function), and it results the improvement of the learning performance.Goal-navigated problem experiments showed the outstanding efficiency of the proposed method comparing to the conventional Sarsa learning with Q-table or SOFNN.The future works of this study is to design an automatic mechanism to find the abstract states, and it may also belong to a RL way.
of input to the second fuzzy net.Additionally, a set of membership function for the abstracted states is designed.Similar to Eq. (1 the distance between the goal and the current state, p=1, 2, …, P is the number of fuzzy membership function in the second fuzzy net.For example, in the goal exploration problem, p k c can be the distance of positions in areas nearby the goal, middle

Fig. 3 .
Fig. 3. Environment used in the goal-navigated problem experiment

Fig. 4 .
Fig. 4. Paths after learning by the proposed method.

Fig. 5 .
Fig. 5.The learning performance comparison between swarm learning and independent (no swarm) learning.

Fig. 6 .
Fig. 6.The change of the number of membership functions during learning.

Fig. 7 .
Fig. 7.The change of the number of fuzzy rules in the first layer of MLFNN during learning.