Success In Generating Various Robot Motions With LLM
3 main points
✔️ Low-level robot controller with LLM without relearning
✔️ LLM outputs a reward function rather than control commands to the robot
✔️ Two stages: description of the desired behavior and output of the reward function
Language to Rewards for Robotic Skill Synthesis
written by Wenhao Yu, Nimrod Gileadi, Chuyuan Fu, Sean Kirmani, Kuang-Huei Lee, Montse Gonzalez Arenas, Hao-Tien Lewis Chiang, Tom Erez, Leonard Hasenclever, Jan Humplik, Brian Ichter, Ted Xiao, Peng Xu, Andy Zeng, Tingnan Zhang, Nicolas Heess, Dorsa Sadigh, Jie Tan, Yuval Tassa, Fei Xia
(Submitted on 14 Jun 2023 (v1), last revised 16 Jun 2023 (this version, v2))
Comments: this https URL
; Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The images used in this article are from the paper, the introductory slides, or were created based on them.
Recently, large-scale language models (LLMs) such as ChatGPT have been very active, and their applications in various fields are being considered. Robots are one such application, and LLMs have been actively studied to generate robot behaviors. to combine them and generate newly constructed behaviors. On the other hand, it is difficult for LLMs to directly output robot control inputs, i.e., generate basic behaviors, even when given knowledge about the robot's behavior. However, since the design of a robot's basic behavior requires specialized knowledge and muddy work, it is desirable to generate the basic behavior using LLMs. In the paper presented in this article, by introducing a certain ingenuity, we have succeeded in skillfully controlling a robot with an LLM like the existing ChatGPT without any additional learning, etc.
The authors of the original paper have a website that shows the robot in action and a conceptual diagram, so please check it out as well.
There are two key points to the methodology of the paper.
The first point is that it outputs its reward function rather than its control input. Until now, the background has been that it has been difficult for LLMs to directly generate control commands so as to achieve the robot's basic behavior. Therefore, the authors have adopted a method of having the LLM generate its reward function, which is the source of its behavior, rather than having the LLM generate control commands. The idea is that the "reward function" can bridge the gap between the "instruction of movement" and the "generation of control commands. This is the idea. The figure below illustrates the concept in an easy-to-understand manner.
The second point is the output method of the reward function. It is not enough to simply instruct the behavior in a language to obtain an optimal reward function. In particular, the more complex the behavior, the more difficult it is to generate the reward function. Therefore, this study adopts a two-step framework in which the LLM is given a detailed description of a given behavior and then outputs a reward function corresponding to that description. This idea is based on two findings: the LLM can easily output a reward function for simple actions, and it is possible to describe complex actions separately from simple ones. It can also be said that the reward function is compatible with writing down complex behavior, since the reward function is a linear sum of each reward term, as shown below.
MPC is an optimization method that predicts the future for a finite time and determines the control input that maximizes (minimizes) the sum of the reward functions up to the predicted future. MPC is an optimization method that predicts the finite future and determines the control input that maximizes (minimizes) the sum of reward functions until the predicted future. Compared to reinforcement learning methods, MPC is used because it does not require learning, so it is easy to check the results, and it is relatively robust.
In the simulation experiment, two quadruped robots and a manipulator are used to perform various tasks as shown in the figure below.For these tasks, we compared two methods. Reward Coder, which directly outputs the reward function without describing the movements, and Code-as-policies, which generates movements by combining simple pre-acquired movements, as the baseline. The figure below shows the results, which show that the system was able to generate an overwhelmingly wide variety of behaviors compared to the baseline.
As you can see in this video, the most distinctive feature is probably the one in which the quadruped robot is made to perform a moonwalk. In response to the verbal instruction "Robot dog, do a moonwalk," the robot is actually able to generate the motion of a moonwalk.
We also conduct experiments on actual equipment for object manipulation using robotic arms. The performance of the actual machine is limited, and it is sometimes difficult to achieve the speed that can be achieved in the simulator. For this reason, we have added a penalty term related to the speed of the robot arm to the reward function. In addition, by combining camera images and depth information from LiDAR to accurately estimate the position of objects, movements such as lifting an apple or a Rubik's cube can be achieved on the actual robot.
The framework for outputting the reward function after the description of the target behavior has been created has allowed us to acquire a low-level controller with LLM. Reward function design is a part of the process that humans muddle through based on expertise and other factors, and it seems attractive that this can be done automatically via natural language.
As for future extensions, three points are mentioned in the paper: First, templates are provided for describing behaviors. Currently, LLMs do not automatically generate good behavior descriptions, but rather provide good templates (prompts), which are then used to describe behavior. The second point is that it is difficult to generate behaviors that cannot be easily described in language (e.g., "walk gracefully"). A possible solution to this is to build a system that can accept multimodal input, such as showing a video of the behavior. Finally, there is the fact that the LLM automatically determines the weights and parameters of the reward term, and the reward term is pre-set by a human. This makes the system stable, but at the expense of some flexibility. He stated that being able to design the reward function from scratch while ensuring stability is an important research direction.
Categories related to this article