Robot Successfully Runs Hiking Trail
3 main points
✔️ Two-stage strategy learning with the teacher and student strategies using privileged information
✔️ Incorporate external information considering sensor uncertainty
✔️ Run a hiking trail at the same speed as a human
Learning robust perceptive locomotion for quadrupedal robots in the wild
written by Takahiro Miki, Joonho Lee, Jemin Hwangbo, Lorenz Wellhausen, Vladlen Koltun, Marco Hutter
(Submitted on 20 Jan 2022)
Comments: Science Robotics, 19 Jan 2022, Vol 7, Issue 62.
Subjects: Robotics (cs.RO)
code:
The images used in this article are from the paper, the introductory slides, or were created based on them.
first of all
In this article, we introduce a paper in which we succeeded in creating a quadruped robot that can move so steadily and quickly that it can run a hiking course at the same pace as a human by combining external information obtained from sensors and internal information such as the position of each joint. This paper is a sequel to the previous paper introduced in this article. The main difference between this paper and the previous one is that the robot can adapt to the terrain by using external information.
There is a video of the robot in action on the website of the authors of the original paper, so please check it out for a better understanding.
result
First, we will look at the performance of the controller we have created. In the actual paper, the performance of the controller is described from various viewpoints such as indoor and underground experiments, but here we will introduce the results of the most characteristic hike in the Alps.
We experimented to walk around a hiking course in the Etzel mountain in Switzerland. The course is 2.2km long, with a height difference of 120m. To complete the course, the hikers had to cross steep hills, high steps, rocky surfaces, slippery ground, and tree roots. The hike planner's difficulty index, which takes into account the intensity of the exercise, is set at the most difficult of the three levels.
The robot that implemented the proposed method ran this difficult course in 78 minutes without falling once. The hiking planner's suggested time for a round trip is 76 minutes, which means that the robot is as fast as a human. Furthermore, for the ascent, they reached the halfway point in 31 minutes, four minutes faster than the expected human time.
technique
This section describes the method used to configure the controller. It consists of the following three stages
- Learning teacher measures
- Learning Student Strategies
- Deploying in the real world
We'll look at each of these in more detail next.
Learning teacher measures
We will learn a teaching strategy that uses privileged information to select the best behavior for later reference when learning student strategies. The system learns to follow a constant speed on a randomly generated terrain. The algorithm is PPO and the strategy is represented by a multilayer perceptron (MLP) that is trained to output the mean and variance of each action.
Observation state and behavior
The observation state is composed of three sets: internal observation state, external observation state, and privileged information. The internal observation state includes the position and velocity of the body and each joint, and the external observation state includes the height of the space sampled at five different radii for each leg. Privileged information includes information that cannot be observed in the real world, such as the friction coefficient of the floor, but is useful for walking.
We use central pattern generators (CPGs) to construct the action space. First of all, we prepare a leg tip trajectory to realize a base periodic walking motion. This means that if we specify a leg and its phase, its tip position can be determined. After that, we calculate the position of each joint by solving the inverse kinematics of the position analytically. At this point the planar surface is walkable, but we can fine-tune the phase and joint positions of each leg to gain flexibility for more difficult terrain. In other words, the action space has 4 dimensions for phase modification for 4 legs and 12 dimensions for modification for 12 joints, totaling 16 dimensions.
Policy Architecture
The solution consists of three MLPs. The external observation state encoder, the privileged information encoder, and the main network. Each of these encoders first converts the external observation state and privileged information into a more compact latent representation, which is then input to the MLP together with the internal observation state. Since the encoders are also used during the learning of the student measures, they also help to transfer the knowledge acquired by the teacher measures.
reward
We will have positive rewards for following the target speed and negative rewards for violating the constraint. Firstly, as for the positive reward, if the projection of the aircraft's speed onto the target speed is above the target speed, the reward is +1, and if it is below the target speed, the reward is exponentially decayed by that amount. Secondly, negative rewards are introduced to achieve smoother movement, such as constraints on body orientation, or penalties on the speed or tumbling of each joint.
curriculum
To gradually improve the performance of the measures, we use two curricula: one to adaptively generate terrain, and one to vary noise and rewards during learning. For the first one, we use a particle filter technique to generate the most efficient terrain for the agent to learn. The second technique is to vary the coefficients that determine the importance of the penalty terms of noise and reward according to the iteration. Unlike the case of adaptive terrain generation, this one is simply determined by the number of iterations.
Learning Student Strategies
Once we have created a teacher measure that can run through a variety of terrains using privileged information, we distill that knowledge into a student measure that acts using only information that is also available in the real world. We use the same environment as when learning the teacher strategy, but we add noise to the external observation state. This noise is designed to simulate various errors in the external observation state that might be encountered in the real world.
And, of course, no sensor can directly observe privileged information, so we adopt a method that uses a series of externally and internally observed states to capture information that cannot be observed as belief states.
The student strategy consists of a recursive belief state encoder and an MLP. The belief state encoder takes an internal observation state, a latent representation of a noisy external observation state, and a hidden state of the recursive network, and returns a belief state. By inputting this belief state and the internal observation state to the strategy, the action is decided. We learn to match the belief state to the latent representation of the external observation state and privileged information. This allows the belief state to have information about the environment that can help the walker.
The architecture of the MLP is the same as that of the supervised strategy, and the learned network of the supervised strategy can be used without modification. This is useful for speeding up the initial stage of training.
The learning is done by supervised learning which minimizes two losses, behavioral loss, and reconstruction loss. The behavioral loss is given by the squared error of the actions of the teacher and student measures and is used to correct the student to behave like the teacher. The reconstruction loss is computed as the squared error between the noise-free height sample privileged information and the belief state decoder.
Height sample randomization
During the training of the student measures, we use a parameterized noise model to inject random noise into the sampled height information, which is an externally observed state. We model the following two types of measurement errors.
- Horizontal misalignment of measurement points
- Height gap
These are determined by the sample values from the normal distribution. The variances are also determined by the reliability of the mappings, which will be discussed later.
The two noises are applied to three scopes, each with different variance. Each scan point, each leg, and each episode. The noise in the scan points and legs is resampled hourly, whereas the noise in the episodes is fixed between episodes.
In addition, to accommodate different real-world conditions, the reliability of the mapping is determined at the start of the episode and the noise is used accordingly. Three types of noise are provided: little or no noise, large offset, and large nose, with percentages of 60%, 30%, and 10%.
Finally, to learn smooth transitions between different terrains, the terrain is divided into a grid on the simulator and different offsets, set for each cell, are added during the height scan.
Belief State Encoder
To integrate the internal and external information, we introduce a gated encoder. This is an idea inspired by gated RNNs and multimodal information fusion.
The encoder learns α, which is an adaptive gating factor to control the rate at which external information passes through.
Initially, the latent and hidden representations of the internal and noisy external observation states are encoded by the RNN to obtain the immediate belief state. The immediate belief state is then used to compute the attention vector α, which determines how much external information is incorporated into the belief state.
Similar gates are used in the decoder to reconstruct privileged information and external observation states; the architecture of the RNN uses a Gated Recurrent Unit (GRU).
implementation
Finally, I will describe the implementation in the real world. We used ANYmal C for the airframe. We have two different sensor configurations, a dome Lidar sensor and an Intel Real Sense D435 depth sensor, and we use PyTorch to train the strategy and deploy it with zero shots and no fine-tuning. The elevation map is updated at 20Hz and the measures are run at 50Hz.
The GPU is used for parallel processing to achieve fast mapping. This plays an important role in maintaining stable and fast processing and quick walking speed.
summary
External perception allows for quick and graceful movement through terrain prediction and allows the user to correct their gait before contact occurs. When external perception is incomplete or incorrect, the controller smoothly transitions to relying solely on internal perception. Learning to integrate external and internal perceptions is done consistently, making the method tractable without heuristics or hand-made rules.
As for future extensions, three points are mentioned in the paper. First, it is about explicitly utilizing the uncertainty of the information. Currently, when there is no information, only internal information is used, and the image is assumed to be a plane and proceeds. Therefore, it is expected that a more practical controller can be created by extending the controller so that it can take actions to obtain information in an exploratory manner, such as moving one leg forward to check the state of the floor when the information is uncertain. The second point is the handling of external data. Currently, we do not handle the data obtained from the sensor directly but construct a model that does not depend on the type of sensor by inserting a layer called an elevation map. However, there is some information that is lost in the process of creating the elevation map, so it would be more efficient to use the raw information obtained from the sensor directly for the control. Finally, as another issue, it is pointed out that it is impossible to realize movements that are far from normal walking. For example, it is not possible to recover when a leg is stuck in a narrow hole or to go over a high step.
Categories related to this article