[RL-GPT] A Framework To Acquire Diamonds Several Times Faster Than Usual With Mincraft Is Now Available

Machine Learning 18/04/2024

3 main points
✔️ RL-GPT is a new framework that combines large-scale language modeling (LLM) and reinforcement learning (RL).
✔️ In RL-GPT, two agents, one fast and one slow, work together to perform a task.
✔️ RL-GPT outperforms traditional methods and can retrieve diamonds in less than a day in the game of Minecraft.

RL-GPT: Integrating Reinforcement Learning and Code-as-policy
written by Shaoteng Liu, Haoqi Yuan, Minda Hu, Yanwei Li, Yukang Chen, Shu Liu, Zongqing Lu, Jiaya Jia
(Submitted on 29 Feb 2024)
Comments: Published on arxiv.
Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

RL-GPT is a new framework that combines large-scale language modeling (LLM) and reinforcement learning (RL).

LLMs can master a variety of programming tools but struggle with complex logic and precise control; with RL-GPT, two agents, one fast and one slow, work together to perform tasks.

The slow agent makes a plan and the fast agent does the actual coding based on that plan. This allows them to accomplish their tasks efficiently.

RL-GPT performs better than traditional methods, allowing players to obtain diamonds in Minecraft games in less than a day. Depending on the usual playing style and luck factors, many players believe that it takes relatively generally 3 to 7 days to find diamonds in Minecraft.

Introduction

This paper is about building AI agents that master tasks in an open world environment. This is one of the longstanding goals in AI research. The advent of large-scale language models (LLMs) has increased the likelihood that this goal will be realized; LLMs are skilled at using computer tools and operating search engines, but they are still limited in the specific environment of an open world. For example, LLM is still inadequate in tasks such as fighting children in video games. For this reason, reinforcement learning (RL) is gaining attention; RL is an effective way to learn from interaction and shows promise in facilitating LLM "practice."

However, the challenge with RL is that it requires large amounts of data, expert demonstrations, and access to LLM parameters, making it less efficient. To solve this, a new approach to integrating LLM and RL has been proposed. This will allow LLM agents to use the RL training pipeline as a tool. This framework, called RL-GPT, is designed to enhance LLM. This approach will allow RL and LLM to work together to solve tasks.

The above figure shows a schematic of RL-GPT. After optimizing the environment, the LLM agent gets the optimized coded action, RL realizes the optimized neural network, and RL-GPT gets both the optimized coded action and the neural network.

Related Research

First, Minecraft is an open-world game, and it is important to build agents efficiently and generically within it. Past efforts have used hierarchical reinforcement learning and often relied on human demonstration. However, these approaches required many steps for tasks that were either short-term or long-term. Later, an approach using LLM was taken. This allowed for task decomposition and high-level planning. However, these methods rely on manually designed controllers and code interfaces and avoid the challenge of learning low-level policies. Finally, RL-GPT was proposed. It extends the capabilities of LLM by equipping it with RL to enable automatic and efficient task learning in Minecraft; RL and LLM have complementary capabilities to each other and integration is expected to result in efficient task learning.

This integration begins with the use of LLM domain knowledge to improve RL, and then research is being conducted on combining LLM and RL to decompose subtasks and generate reward functions RL-GPT equips RL as a tool to ensure that LLM skills are continually improved and competence is maintained It is one of the first studies to

Proposed Method

RL-GPT consists of three major components.

(1) The slow agent decomposes a given task into multiple subactions and determines which actions can be coded directly.
(2) The fast agent writes the code and sets up the RL.
(3) The iterative mechanism coordinates both the slow and fast agents to improve the overall performance of RL-GPT.

Within RL-GPT, the RL interface provides the following components: learning task, environment reset, observation space, action space, and reward function. This allows the integration of RL and Code-as-policy. Low-speed agents use GPT-4 to decompose a given task into subactions and determine if they can be coded. On the other hand, the fast agent also uses GPT-4, which translates instructions from the slow agent into Python code and modifies it based on feedback from the environment; the two loop iteration mechanisms are used to optimize the fast and slow agents. It also introduces a task planner to handle complex tasks. Together, these components allow RL-GPT to handle complex tasks and provide efficient task learning.

The overall framework consists of slow agents (orange) and fast agents (green). The slow agent decomposes tasks and determines "which actions" to learn. The fast agent creates code and RL configurations for low-level execution.

Experiment

First, the environment used in the study is a framework called MineDojo, which is a pioneering framework for setting up various tasks within a Minecraft game, including long-term tasks like cutting trees or creating items included. The methodology employed in the study is then described: the method, called RL-GPT, uses an AI model called GPT-4. The method employs proximity policy optimization (PPO), which samples data from interactions with the environment and uses stochastic gradient ascent to optimize the agent's policy.

The main results stated that the RL-GPT method performed better than other baseline methods.

RL-GPT achieved the highest success rate in the MineDojo task.

This is the main result of the Get Diamonds task in Minecraft. Existing strong baselines for the task require expert data (VPT, DEPS), handcrafted policies for subtasks (DEPSOracle), or a huge number of environmental steps for training (DreamerV3, VPT). Our method can automatically decompose and train subtasks with only a small amount of human prior work, and can acquire diamonds with excellent sample efficiency.

Comparisons have been made with existing methods such as DreamerV3, VPT, DEPS, and Plan4MC, showing that RL-GPT achieves success rates of 8% or higher.

This is a demonstration of how different agents learn how to collect logs. Looking at this process in more detail, RL-GPT first attempts to code all actions related to log collection, with a success rate of 0% on the first iteration. RL-GPT then selects an action that aims at the tree and attacks it 20 times, which it then executes. However, finding the tree proves to be too difficult for LLM, and the agent is instructed to choose a finer action. Ultimately, RL-GPT finds the correct solution through a combination of coding navigation and attacks, and performance improves in subsequent iterations. This demonstrates the process by which RL-GPT effectively learns tasks and improves its success rate.

In addition, the study discusses why RL-GPT performed well on a variety of tasks within Minecraft, as well as future applications. This suggests that this research could be applied not only to in-game AI training, but also to real-world problems.

Conclusion

In this study, we propose RL-GPT, a novel method that combines large-scale language modeling (LLM) and reinforcement learning (RL). This can enhance agents working on difficult tasks in complex games such as Minecraft, etc. RL-GPT splits the task into higher levels of coding and lower levels of RL-based actions, making traditional RL methods and existing GPT agents show better efficiency than those of traditional RL methods and existing GPT agents. This results in good performance on difficult tasks such as Minecraft.

Looking to the future, RL-GPT is expected to have a wider range of applications. For example, the method can be used to address other games and real-world problems. In addition, as RL-GPT is improved and new application methods are developed, it may be possible to address more advanced tasks.