EUREKA: Automated Compensation Design With LLM

RLHF 04/12/2023

3 main points
✔️ Proposed EUREKA, a method for autonomous reward design
✔️ Combines LLM's code generation capabilities with evolutionary optimization
✔️ Demonstrates better performance than manually designed reward functions and is applicable to curriculum learning and gradient-free RLHF shown to be applicable to curriculum learning and non-gradient RLHFs

Eureka: Human-Level Reward Design via Coding Large Language Models
written by Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, Anima Anandkumar
(Submitted on 19 Oct 2023)
Comments: Project website and open-source code: this https URL
Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

Reinforcement learning is an algorithm that has achieved excellent results in a variety of domains, but when applied to real-world applications, it has been difficult to properly design a reward function.

The research introduced here is an attempt to utilize large-scale language models, which have recently become popular, to address this problem.

Let's take a closer look at large-scale language models and reinforcement learning and how they are connected.

Compensation Design Issues

Reinforcement learning is an algorithm in which an agent receives some observed value $O$ and learns a policy $Pi$ that outputs an action $A$ that maximizes an objective function called the reward function $R$.

A common situation in which reinforcement learning is employed is when there is some task (objective) that the agent needs to accomplish, but it is not obvious by what process it will accomplish that objective. For example, in the problem of making a stick stand balanced on the palm of a robot hand, it is difficult for a human being to describe control commands sequentially, so we aim to have the agent learn control methods autonomously by utilizing the exploratory capabilities of reinforcement learning.

However, the more complex the task, the more divergence there is between the objective function $R$ of reinforcement learning and the measure of performance that we humans want to achieve (here called the fitness function $F$).

Simply trying to train it to maximize $F$ can very well take an enormous amount of time for the reinforcement learning agent to find a solution.

Thus, it is known as a difficult problem to appropriately design the reward function $R$, which is the objective function for the reinforcement learning agent to maximize the performance index $F$ that we, humans, wish to achieve, and this is called the Reward Design Problem (RDP) [Singh et al. RDP) [Singh et al. 2010].

The paper presented here deals with the Reward Generation Problem, in particular, the problem of outputting a reward function $R$ in code that maximizes the degree of adaptation $F$ given a string describing a task.

Methodology: Details of EUREKA

For the above reward design problem, the paper presented here proposes a method called EUREKA, which consists of three components: presentation of information about the environment, evolutionary optimization, and reflectionon the reward function.

Presentation of information about the environment

Since the LLM accepts text data as input, the problem is how to convert the structure of the environment into text data. This method simply proposes to feed the raw code of the environment directly to the LLM.

Given that the LLM itself is trained with code data, it can make good use of that ability, and the expectation is that by giving it the code of the environment, it will be able to fetch information about the meaning of the environment and the variables to be used.

Evolutionary optimization

In one iteration, 16 candidate reward functions are generated, the best performing one is targeted for improvement, and 16 new candidates are generated, and this is repeated for 5 iterations. In addition, to eliminate dependence on initial values, this is done a total of 5 times.

Reflection on the reward function

In order to accurately evaluate the reward function, it is necessary to be able to explain what part of the reward function is working and how. For this purpose, the output results of each term of the reward function should be saved for later reference.

Experiment

Experiments will evaluate EUREKA's performance in various robot environments and tasks.

GPT-4 is mainly used as LLM.

Environment

Ten different robots will be employed as agents and 29 different tasks will be validated on the Isaac Gym simulator.

Robot control tasks such as quadrupedal, bipedal, and arm tasks (Issac) and tasks requiring dexterity control of the hand such as object delivery and cup rotation (Dexterity) are employed.

Baseline

The following three comparables will be used to evaluate EUREKA's performance.

Sparse: Function to determine success or failure of a task. Synonymous with the adaptivity function $F$.

Human: The original reward function defined by the reinforcement learning researchers themselves who designed the task.

L2R: A reward function design method using LLMs proposed by [Yu et al., 2023]. The environment and task are described in natural language, which is input to the first LLM to generate text describing the agent's behavior. This is then input into a second LLM, which designs the reward function code using pre-prepared reward function primitives.

Experimental results

The following graph compares the performance of agents trained using reward functions from various methods.

The performance was above human level on all Issac tasks and on 15/20 Dexterity tasks.

The following graph shows the change in performance when the best strategy is used for each iteration in evolutionary optimization. From this, we can see how the performance consistently increases with evolutionary optimization.

The graph below shows the correlation between rewards generated by EUREKA (Eureka Rewards) and those set by humans (Human Rewards) (vertical axis) and the relative performance between Eureka Rewards and Human Rewards (horizontal axis).

A weak positive correlation was found between Eureka Rewards and Human Rewards. On the other hand, the correlation was found to be weaker for some tasks, and some reward functions were created that showed negative correlations.

This shows that EUREKA can design reward functions that humans cannot find.

The graph below shows the results of a study of EUREKA's performance on the difficult task of pen spinning.

Pre-trained is a strategy that is pre-trained in EUREKA to change the direction of the pen on the hand, while Fine-Tuned is a strategy that is fine-tuned for pen-turning in EUREKA.

On the other hand, Scratch is a strategy for having students learn penmanship from the beginning, without going through this two-step learning process.

This shows that penmanship is successfully achieved only when the students are given a two-step learning process (curriculum learning) using EUREKA.

These results show that EUREKA can apply existing learning techniques, such as curriculum learning, to design autonomous rewards for highly difficult tasks.

We also think that for some tasks, the adaptivity function $F$ may not be available. In such cases, our experiments have shown that it is possible to receive human feedback via text and use this method to improve rewards.

For example, the EUREKA method (denoted EUREKA-HF), which improved the reward function based on textual human feedback rather than $F$ when learning a humanoid walking task, resulted in results that were more aligned with human preferences than the simple EUREKA method.

EUREKA-HF is a breakthrough in that it is a new RLHF method that does not require gradient calculations.

Summary

The paper presented in this issue proposed EUREKA, which combines LLM and evolutionary optimization methods to enable autonomous reward design.

It is a versatile method that can demonstrate high performance without the need to devise task-specific prompts or human intervention.

It will be applied to a variety of issues in the future.

Categories related to this article

Abe