Catch up on the latest AI articles

GLAM: LLM As A Reinforcement Learning Agent

GLAM: LLM As A Reinforcement Learning Agent

Large Language Models

3 main points
✔️ Viewing large-scale language models as a strategy in reinforcement learning and learning additional ones
✔️ Developing an environment and prompts to represent reinforcement learning tasks in language
✔️ Using large-scale language models for reinforcement learning has been shown to improve sample efficiency and generalization performance

Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning
written by Thomas CartaClément RomacThomas WolfSylvain LamprierOlivier SigaudPierre-Yves Oudeyer
(Submitted on 6 Feb 2023 (v1), revised 12 May 2023 (this version, v2), latest version 6 Sep 2023 (v3))
Comments: Published on arxiv.

Subjects:  Machine Learning (cs.LG)


The images used in this article are from the paper, the introductory slides, or were created based on them.


Recently, it has become clear that large-scale language models (LLMs) based on Transformer exhibit a variety of capabilities. Among them, it has been shown that LLMs are able to capture some of the physical laws of the world we humans live in. For example, they have been shown to have prior knowledge of space and the affordances of bodies and objects.

However, it is also said to suffer from grounding shortcomings in the real world, which prevents it from understanding the meaning of concepts and using knowledge appropriately in the environment.

There are three possible causes for this

(1) LLM's training process based on next-word prediction is not directly related to problem solving in the environment
(2) Lack of ability to work with the environment to identify causal structures
(3) Lack of ability to learn from data gathered as a result of interaction with the environment

This study investigates the question of whether LLMs can be used as policies in reinforcement learning. We investigate whether agents (LLMs) can appropriately ground their knowledge based on new observations as a result of their actions in interaction with their environment.

Specifically, the following questions will be investigated in the experiment part

Q1: Sample efficiency
How quickly can LLMs adapt and learn for natural language-directed spatial navigation problems?

Q2: Generalization to new objects
Can we show generalized ability to new objects in a trained task?

Q3: Generalization to a new task
Can you generalize to a new task with a zero shot?


In this paper, we propose a method called GLAM (Grounded LAnguage Models).

It uses LLM as a strategy for reinforcement learning agents, which are functionally grounded in their interactions with the environment using online reinforcement learning (i.e., the internal processing of symbolic operations models, predicts, and controls the external physical processing), and aims to achieve goals described in language based on observation and reward information. The method aims to achieve the goal described in the language based on observation and reward information. The overall picture of this method is shown in the figure below.


We use an environment called BabyAI-Text, which is a modification of the BabyAI [Chevalier-Boisvert et al., 2019] platform that can be represented using only text. This environment is a mini-grid world (represented by the black rectangle in the figure above) in which the agent can move and interact with objects, with six commands: turn left, turn right, go forward, pick up, drop, and toggle. The command commands are turn left, turn right, go forward, pick up, drop, and toggle.

How to calculate the probability of selecting an action

Let $p$ be a prompt and calculate the probability of a word sequence $a_i = {w_0, ..., w_{|a_i|} }$ representing an action by the following formula.

This logarithm is computed for each action, and

The result of softmaxing them is used as the selection probability for each action.

Fine-tuning with PPO

Since PPO is an actor-critic reinforcement learning algorithm, it requires a value function network. Therefore, a value function head is added to the first decoder block layer of the LLM model. With this setup, fine-tuning the LLM in the BabyAI-Text environment.


GLAM was applied to a large pre-trained Flan-T5 780M [Rae et al., 2021] language model for fine-tuning and comparison with other baseline models.

The proposed method is represented as GFlan-T5, and the other baseline methods are NPAE-Flan-T5 (version without pre-training), DRRN (regular reinforcement learning method), and Symbolic-PPO (PPO agent trained with symbolic observation information in BabyAI environment. Symbolic-PPO is a PPO agent that learns using symbolic observations in the BabyAI environment, but does not use the language information in BabyAI-Text).

At each step, enter the following prompts for the agent

A sentence indicating a possible action
Possible actions of the agent: <list of actions>

A sentence indicating the goal of the agent
Goal of the agent: <goal>

Text showing observations of the last 3 steps and actions of the last 2 steps
Obs. 0: <description from BabyAI-Text at step t-2 >
Action 0: <action chosen by the agent at step t-2 >
Obs. 1: & lt;description from BabyAI-Text at step t-1 >
Action 1: <action chosen by the agent at step t-1 >
Obs. 2: <description from BabyAI-Text at step t >
Action 2: <the next action to be chosen by the agent>


Q1: Sample efficiency

To determine how quickly LLM agents can adapt to solve a task, we trained them on 1.5 million steps. For each episode, a goal was randomly set from the following multiple patterns.

Go to <object>: task to go to a given object
Pick up <object>: task to pick up a given object
Put up <object A> then go to <object B> or Go to <object B > after pick up <object A>: task to perform the actions of picking up and going to an object in sequence
Unlock <door>: task to unlock a door with a key

The success rates of the four types of agents for this task are shown in the following figure.

This shows that only the proposed method GFlan-T5 is immediately adaptable to the task.

Comparison with NPAE-Flan-T5 shows that GFlan-T5 is able to effectively utilize knowledge from prior LLM training, and that fine-tuning enables it to grasp object concepts.

Comparison with Symbolic-PPO shows that the linguistic information contributes well to the learning of this task.

In summary, the results show that LLMs perform better on reinforcement learning tasks after fine-tuning thanks to the linguistic prior knowledge gained in pre-training.

Q2: Generalization to new objects

Investigate whether LLM agents fine-tuned in the BabyAI-Text environment can handle new objects that were not seen at the time of fine-tuning.

The results correspond to (b) and (c) in the following table.

(b) when the object is renamed, and (c) when a new object is created, respectively.

In both cases, GFlan-T5 shows high performance, indicating that it is able to acquire symbols that successfully represent the structure and directives of the environment.

Q3: Generalization to new tasks

We investigated their ability to cope with the task when (d) the order of the target tasks was changed, (e) the expressions of the actions were replaced by synonyms, and (f) the language used was replaced by another language (French).

The results correspond to (d), (e), and (f) in the previous image, and in all cases, GFlan-T5 did not perform well. Generalization to new tasks seems to be difficult.


The paper presented here proposes a method called GLAM and shows that by fine-tuning pre-trained LLMs for RL tasks, it is possible to map the dynamics of the environment to the symbols of the language.

There are some limitations in that the environment used is linguistically describable, and there are some challenges in the action space and LLM model size, but these will be overcome in future research.

  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us