Can Wikipedia Assist Offline Reinforcement Learning? Introducing Pre-training In Language Tasks To Offline Reinforcement Learning!

Offline Reinforcement Learning 11/10/2023

3 main points
✔️ To solve the difficulty of collecting large data sets in offline reinforcement learning, we propose pre-training on language tasks with different domains
✔️ We propose a technique to transfer the performance of models pre-trained on language tasks during offline reinforcement learning
✔️ Experimentally outperforms existing methods in terms of convergence speed and performance outperforms existing methods in terms of convergence speed and performance

Can Wikipedia Help Offline Reinforcement Learning?
writtenby Machel Reid, Yutaro Yamada, Shixiang Shane Gu
(Submitted on 28 Jan 2022 (v1), last revised 24 Jul 2022 (this version, v3))
Comments: Published on arxiv.
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Background

In recent off-line reinforcement learning, frameworks such as Decision Transformer, which formulate reinforcement learning as a sequence modeling task and realize measures by means of models with an autoregressive mechanism, have been successful. On the other hand, such methods suffer from slow convergence when trained from scratch. In this paper, we developed a method for pre-training sequential models for subsequent reinforcement learning in different domains, such as language and visual tasks, and experimentally verified its effectiveness.

Technique

Figure 1: Conceptual diagram of the method.

First, we describe the basic elements of reinforcement learning as sequence modeling. We assume that a sequence $t$ of states, actions, and future returns, called the following trajectories, is given as data.

$$t = (\hat{R}_1, s_1, a_1, \dots, \hat{R}_N, s_N, a_N)$$
where $s_i, a_i$ are the state and action at time $i$ and $\hat{R}_i$ is $\hat{R}_i = \sum_{t=i}^{N}r_t$. If you consider $s_i, a_i, \hat{R}_i$ at each time as tokens, you will find that the same framework can be used for training language models.
In this paper, we aim to pre-train the Transformer on problems that are different in domain from reinforcement learning, such as language and visual tasks [Figure 1]. They consider that the problem is the discrepancy between the representations for language and visual tasks and those obtained by reinforcement learning, and propose two techniques to eliminate the discrepancy between the representation vectors obtained by pre-training and those obtained by reinforcement learning.

Similarity loss between language representation and offline RL representation

Let $V$ be the vocabulary size of the pre-trained Transformer, and let $E_1, \dots, E_j, \dots, E_V, \quad \forall j, E_j \in \mathbb{R}^d$ be the respective embedded vectors. Let $I_1, \dots, I_i, \dots, I_{3N}, \quad \forall i, I_i \in \mathbb{R}^d$ be representation vectors in which the state, action, and future returns are embedded separately in each series. We introduce the following loss so that the expression vectors $I_1, \dots, I_{3N}$ for off-line reinforcement learning are close to those of the language already obtained.

$$\mathcal{L}_{\mathrm{cos}} = - \sum_{i=0}^{N} \max_j \mathcal{C}(I_i, E_j)$$

where $\mathcal{C}$ is the cosine distance. With this loss, we expect that the representations obtained by reinforcement learning will not deviate from those acquired in the language task, and will help Transformer to perform well as a sequential model, which has achieved high performance in the pre-training.

Language model co-training

In this paper, we also continue learning in the language task during the offline reinforcement learning phase. By doing so, we expect to benefit more directly from sequential modeling tasks such as the language task during the offline reinforcement learning phase. The final objective function is as follows
$$\mathcal{L} = \mathcal{L}_{\mathrm{MSE}} + \lambda_1 \mathcal{L}_{\mathrm{cos}} + \lambda_2 \mathcal{L}_{\mathrm{LM}}$$
where $\mathcal{L}_{\mathrm{MSE}}$$ is the loss function for offline reinforcement learning using a Transformer like Decision Transformer, $\mathcal{L}_{\mathrm{LM}}$$ is the loss for the language task, and $\lambda_1, \lambda_2$ are hyperparameters.

Experimental setup

The setup of this experiment is described below. In this experiment, we test the effectiveness of pre-training on language and image recognition tasks using off-line reinforcement learning benchmarks. Below is a brief description of the models used for pre-training, the reinforcement learning baselines to be compared, and the tasks used to test the performance.

Pre-learning model

Language Task: 1. GPT-2-small 2. a model trained on the Wikipedia-103 dataset with the same number of parameters Transformer for a fair comparison with Decision Transformer. They call it ChibiT ( small language model ).
Image recognition tasks: 1. CLIP (Contrastive Language-Image Pre-training). CLIP consists of a text-encoder and an image-decoder, and is trained to predict the matching between the caption and the image. Each encoder consists of a Transformer. ImageGPT. ImageGPT has the same architecture as GPT and is trained by a pixel prediction task instead of a language task.

Reinforcement Learning Baseline

In this study, Decision Transformer (DT), which uses Transformer without prior learning, and CQL, TD3-BC, BRAC, and AWR as off-line reinforcement learning methods without Transformer are used as the baseline for reinforcement learning.

Task

In this experiment, we evaluate the models against Atari and Open AI Gym Mujoco, respectively, using D4RL, a dataset dedicated to offline reinforcement learning. D4RL provides data collected by different quality of action measures for each task.

The Atari task: Breakout, Qbert, Pong, and Seaquest are used and compared to the baseline model.
OpenAI Gym tasks: HalfCheetah, Walker2d, and Hopper are used to train and evaluate models.

Results and Analysis

Figure 2: Performance comparison on the Mujoco task.

As shown in Figure 2, ChibiT and GPT2 pre-trained on the language task show higher performance than DT without pre-training. Moreover, they achieve the same or even higher performance than the off-line reinforcement learning methods, CQL and TD3-BC. From these results, we can see that pre-training with language tasks is effective in terms of performance. A more detailed description of the analysis is given below.

Convergence speed

They compare the speed of convergence of the model with no prior training and the model pre-trained by the language task. Here, convergence is defined as the difference between the mean return and the maximum return being within 2 in the normalized score. As can be seen in Figure 3, models ChibiT and GPT2 pre-trained by the language task converge more than twice faster than DT.

Vision vs. Language

Figure 4. Visualization of the attention mechanism.

Figure 2 shows that the performance of both CLIP and iGPT pre-trained on the image recognition task is lower than that of the models pre-trained on the language task. In particular, the performance is significantly lower for iGPT, which is pre-trained only for the image recognition task. The authors attribute this to the basic similarity between language modeling and trajectory remodeling, and visualize the pattern of attention mechanisms to test the hypothesis. [Figure 4]. From Figure 4, we can see that the attention pattern of iGPT is significantly different from that of DT, and is also less interpretable. Together with these additional experiments, it is suggested that the language task is more beneficial as a pre-training task for the Transformer in reinforcement learning.

Summary

How was it? In this article, we introduced a paper showing that the performance and convergence speed of off-line reinforcement learning with Transformer can be improved by pre-training with a language task. It is very interesting to see the synergy between reinforcement learning and language tasks, which seem to have completely different characteristics. We will keep an eye on reinforcement learning with Transformer in the future.