A New Reinforcement Learning Robot Walking Algorithm Based On Neural Circuits! Realization Of A Bipedal Walking Framework Based On The Control Mechanism Of Neurobiological Movement Patterns!
3 main points
✔️ We utilize central pattern generators (CPGs) for reinforcement learning (RL) to achieve adaptive bipedal walking in the environment.
✔️ Proposed the "CPG-actor-critic" method as an autonomous learning framework for CPG controllers
✔️ Decrease in the number of falls and increase in the number of successful walks on various ground surfaces
Reinforcement learning for a biped robot based on a CPG-actor-critic method Author links open overlay panel
(Submitted on Aug 2007)
Comments: Accepted by Neural Networks.
Subjects: Reinforcement Learning (cs.RL); Machine Learning (stat.ML)
Is it possible to achieve stable bipedal walking in robots through neural circuit-based reinforcement learning?
The purpose of this research is to stabilize the bipedal walking of a robot by introducing the structure of central pattern generators (CPGs), which is a kind of neural circuit mechanism, to the Actor-Critic method of reinforcement learning. We aim to derive the parameters that realize stable walking. Robotic bipedal walking has been studied for more than half a century, but there are many problems to be solved, and human bipedal walking has not yet been realized - realization of walking using the nervous system and improvement of stability. For these tasks, it is necessary to reproduce rhythmic behaviors that take into account the interaction between the robot and the environment, such as the state of the ground surface, and to set parameters that are adaptive to the environment. To address these issues, we focus on reinforcement learning (RL) and aim to learn optimal parameters through the interaction between the environment and the agent. In particular, by adopting the RL algorithm called Actor-Critic and introducing the mechanism of CPG that generates rhythmic motion, we aim to derive parameters that can be applied to environmental changes to achieve stable walking.
Bipedal walking in robots
First of all, we will briefly explain robot bipedal walking, which is the research subject of this paper.
Humans unconsciously walk on two legs, but stable bipedal walking is considered to be a relatively difficult task for robots. Since the beginning of bipedal walking research about half a century ago, no robot has been developed that can outperform human ability, and many studies on the mechanism of human walking have been reported.
It has been pointed out that human bipedal walking is characterized by three features: stability in the structure of the legs; mechanisms in the nervous system, such as the spinal cord, that generate rhythmic movements; and higher-order motor control by the brain. Humans use these mechanisms to achieve a stable and energy-efficient bipedal gait. For example, on an uneven surface, the muscles that autonomously move the foot forward and the neural circuits that control the movement operate. At this time, the movement of the muscles is controlled by a group of neurons called central pattern generators (CPGs): the oscillatory phenomenon of the CPGs outputs electrical signals with a certain rhythm, causing the leg muscles to contract. In addition, when walking on a gentle slope, it is necessary to have a function to prevent falling and stabilize the body against changes in the road surface, and these functions are provided by the CPG. In this research, we aim to realize stable bipedal walking adaptive to the environment by introducing the mechanism of CPGs that control such rhythmic movements, including bipedal walking, into RL.
Previous research and issues on central pattern generators (CPGs)
As mentioned above, the CPG is a neural mechanism that controls rhythmic movement patterns, such as bipedal walking, and is located in the spinal cord of vertebrates. It has also been shown that feedback signals to the sensory system play a role in regulating the physical system and the CPG, stabilizing rhythmic movements: in a study using lamprey eels, we simulated swimming movements using a spinal cord model consisting of 10 rigid links and observed the CPG of the live fish is observed. In addition, a study on the simulation of human-like bipedal walking has been reported by developing a human lower body model (bipedal walking) and a CPG controller.
In this context, the design of controllers to control CPGs - CPG controllers - has been identified as a challenge due to the complexity of the parameters when designing a controller that depends on both the robot (physical system) and the environment. Genetic Algorithms (GAs) have been used in many studies to determine these issues: efficient "walking" salamanders, simulation of humanoid bipedalism for energy-saving and long-distance walking, etc. The GAs model the evolutionary process of an animal species, selecting individuals from the population that are better adapted to interact with the environment and updating them based on the parameters of the best fit individuals.
purpose of one's research
In this study, we focus on reinforcement learning, especially Actor-Critic, for modeling the interaction between the environment and objects, and propose a new algorithm - CPG-actor-critic algorithm -is proposed in this paper. This algorithm is based on reinforcement learning (Actor-Critic method), which learns action selection (strategy) by the interaction between an agent and its environment. By introducing the aforementioned CPG, it incorporates a mechanism to control rhythmic motion and stabilizes bipedal walking. We aim to realize stable bipedal walking.
Overview of the proposed method (CPG-Actor-Critic method)
In this section, we describe the CPG-Actor-Critic method, which incorporates CPG, a neural control mechanism, into the Actor-Critic method.
The Actor-Critic (AC) method is a combination of Value-based and Policy-based reinforcement learning algorithms and is characterized by the use of different state-action pair tables for determining actions (Actors) and updating states (Critics), which are then updated and learned from each other. An example of the application of the AC method to robots is the implementation of an Actor as a controller that provides control signals to a physical system - this system outputs signals to the controller according to the observed state: the Critic evaluates the current state of the physical system and the Actor updates the control signals. The system outputs signals to the controller according to the observed state: Critic evaluates the current state of the physical system and quantifies the effectiveness of the control signals provided by the Actor.
On the other hand, the RL algorithm aims at learning stationary measures, and therefore, when introducing RL, these discrepancies - the non-stationarity of CPGs and the stationarity of RL In addition, CPGs are temporal. In addition, since CPGs include a temporal component, many studies have adopted recurrent neural networks (RNNs) and backpropagation through time (BPTT). Therefore, due to the challenge of these algorithms - keeping the past history - a lot of computation time and storage is required. The proposed method (Fig. 1) addresses these issues by dividing the CPG controller into two parts: the basic CPG and the Actor: the basic CPG is set up as a dynamic system with the physical system and the basic CPG as one, with fixed connection weights (CPG-coupled system); the Actor is defined in a way that it controls the CPG-coupled system and outputs indirect control signals. These changes allow the internal state (context) of the CPG to be located outside of the controller (context-independent), and the Actor can be viewed as a static controller of the CPG-coupled system with no interconnections. This allows the application of RL algorithms to derive the static state, and by statically defining the intrinsic parameters of the system - frequency, amplitude, etc. of the vibration output - it is possible to avoid learning the vibration output from scratch and reduce the learning cost. By statically defining the intrinsic parameters of the system - frequency, amplitude, etc. of the vibration output - it is possible to avoid learning the vibration output from scratch and reduce the learning cost - specifically, by restricting the control space so that the controller output is periodic, reducing the overall amount of history to be retained. Thus, the space of interest and the parameter restrictions allow for the introduction and efficient learning of the learning RL algorithm.
In the simulation environment, stochastic measures are defined as normal distribution and output: the mean value is calculated from the weighted sum of the CPG output and the sensory feedback signal, and these are used as the measurement parameters in the RL. The standard deviation is fixed at 0.01 for the simulation - an excessive standard deviation will cause instability in the motion due to noise in the controller, while an excessive standard deviation will reduce the ability to search for unknown measures. In addition, Critic's prescribed functions are chosen heuristically - the number of basis functions is kept as small as possible to account for approximations to the state-value functions and to improve the reliability of the least-squares estimation.
Simulation of bipedal walking
The purpose of this evaluation is to clarify whether our method is able to obtain a CPG controller that can make a biped walk stably. In the evaluation results (see the figure below), before training, we set the parameters of the Actor to a random vector with small values and initialized it so that it cannot walk with both legs slightly open. From the evaluation results, we can see that after about 4000 training sessions, the number of falls before 5 seconds has decreased and the average reward per step has increased.
Simulation on various ground surfaces
In this evaluation, we investigate whether the CPG controller can acquire a policy for stable gait under various ground conditions. The simulations are conducted on three different types of ground conditions: a descent of -0.15 rad, an ascent of 0.1 rad, and a rough terrain where the slope of the straight section is randomly set in the range [-0.1, 0.1] rad (see figure below). First, the parameters of the Actor were initialized with manual values plus a small number of random values. We changed the criterion of tipping because the vertical velocity is likely to decrease (increase) on downhill (uphill) slopes. We report that the proposed method - CPG-actor-critic - obtained a good controller that can walk stably without falling downhill, uphill, and uneven terrain by learning about 3000 times, 1000 times, and 4000 times, respectively. In addition, as a comparative performance of the obtained actors (see the table below), we repeated the control for 1 minute 100 times for each parameter of the actor by various ground conditions - a flat plane, three downhills, two uphills, and two kinds of rough surface control tasks - and the number of successes is shown in Fig. 1. The results show that the proposed method tends to have more successes - better performance - than the manually set parameters for downhill, uphill and rough surfaces, depending on the parameters acquired through learning. They attribute this improvement to the intrinsic noise added by the stochastic actors, which makes the policy robust to external noise from the environment.
In this study, we introduce a CPG structure into the Actor-Critic (RL) part of a robot to stabilize its bipedal walking and learn parameters that take rhythmic motion into account. In the proposed method, we introduce the CPG structure into the Actor part and output the control signal considering the CPG structure to derive the parameters to realize stable walking. As a result of the simulation, we have confirmed that the learning process reduces the number of falls and improves the robustness of the system to realize stable walking on various ground surfaces. These results are expected to provide an alternative to genetic algorithms (GAs), as they do not require the definition of populations to derive optimal individuals, which is often required in GAs.
As an additional evaluation, the study profiles the learning of the Actor when its initial parameters are changed. In the paper, we show that in some cases the performance of the controller improves significantly after a few hundred episodes, while in other cases the performance decreases after a few hundred episodes: the average reward does not increase until 10,000 learning (failures) due to repeated failures, and after a few learning episodes, the robot falls over, and so on. To address these issues, we are considering the following improvements: employing an additional basis function in Critic, and making the parameters of the strategy probabilistically generated from a normal distribution, and setting it up as an ensemble learning kind. In the gradient method, there are cases where the local optimum depends on the initial training conditions, but ensemble learning eliminates such dependence on the initial conditions, presumably resulting in a more robust controller. In the paper, we show that these improvements resulted in 22 successes out of 50 runs.
In another analysis, we compare the GA and the proposed method and show that the proposed method is comparable to simulations with learning parameters after 100 generations. The evaluation points out that the GA showed a gradual increase in the learning of the population as the evolution progressed, while the RL of the proposed method showed a more rapid change in the average reward. These results indicate that our RL method is suitable for improving potential individuals because the proposed method trains individuals many times, while GA is suitable as a method for searching for optimal individuals from within a population.
On the other hand, it raises issues when applying the method to real robots - especially issues such as breakage due to tipping over. Since the proposed method requires several thousand training episodes, there is a high possibility that the control unit will be damaged if the robot is repeatedly tipped over. As a solution to this problem, we propose a method in which a simulator is used to train the controller, and then the controller is retrained to control the real robot.
Categories related to this article