# I Would Like To Run Decision Transformer In A Stochastic Environment As Well!

3 main points
✔️ Present the properties of information statistics necessary for RvS success in a stochastic environment
✔️ Propose an algorithm ESPER to estimate the above information statistics
✔️ Achieve performance beyond existing RvS methods, Decision Transformer, in stochastic environments such as 2048

written by Keiran PasterSheila McIlraithJimmy Ba
(Submitted on 31 May 2022 (v1), last revised 28 Nov 2022 (this version, v2))
Comments: Added experiments with Decision Transformers; Fixed error in Theorem 2.1; Updated related works; Added link for code

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## Introduction

Recently, a framework called Reinforcement learning via Supervised learning (RvS), which solves reinforcement learning problems by solving prediction tasks (supervised learning) of behaviors conditioned by trajectories and rewards using generative models such as Transformer framework called Reinforcement learning via Supervised learning (RvS) has attracted much attention. In the existing RvS methods, the model is trained to learn the relationship between the reward of the whole trajectory and the behavior by predicting the behavior with the reward of the whole trajectory as well as the history of the state behavior as inputs during training, and then aims to achieve high performance by inputting high reward during runtime. In this paper, we show that when the environment is stochastic, agents with high performance cannot be trained by simply conditioning on the reward of the trajectory, and we present a solution to this problem, ESPER.

## Decision Transformer and RvS

In this chapter, we first introduce the Decision Transformer, a typical method of RvS, and then formulate RvS as a generalization of the Decision Transformer. The basic notations of reinforcement learning are summarized. Reinforcement learning is formulated in terms of a Markov decision process. A Markov decision process consists of a set of states $S$, a set of actions $A$, a transition probability $T$, a reward function $r$, and a discount rate $\gamma$. In reinforcement learning, the goal is to learn an action policy $\pi$ that maximizes the cumulative discounted return, $\sum_{t} \gamma^t r_t$ (hereinafter simply referred to as return). The history of states, actions, and rewards in an episode is called a trajectory and is denoted by $\tau = (s_0, a_0, r_0, \dots)$. The Decision Transformer learns a policy from given interaction data $D = \{\tau_i\}_{i=1}^N$, modeling it in the following way using the transformer.

$$\pi(a_t|s_0, a_0, \dots, s_t, \hat{R}) = p_D(a_t|s_0, a_0, \dots, s_t, \hat{R})$$

where $p_D$ is the distribution induced from the data. where $\hat{R}$ is the return of the trajectory. At each time $t$, the strategy is learned by solving the maximum likelihood estimation problem of the behavior, conditioned on the previous state, the sequence of actions, and the returns. It is easy to understand by contrasting it with the next word prediction by the transformer in the language task. The transformer learns language usage by solving the task of predicting the most likely next word when conditioned on the previous word. On the other hand, the Decition transformer learns the relationship between returns and actions by predicting the most plausible action from the data when conditioned with the history of the previous state actions and the returns of the entire trajectory.

This can be generalized as follows. First, consider the information statistic $I(\tau)$ determined from the orbit ( $\tau$). This is like a score of the orbit, and $I(\tau)$ is a return in the decision transformer. Then, we consider an index $D$ that measures the distance between the information statistics. In the Decision Transformer, this is the conditional likelihood. With this element introduced, RvS can be formulated as a problem of learning $\pi$ that leads to the trajectory $\tau$ that minimizes the distance $D$ between $z$ and $I(\tau)$, given a distribution of information statistics $z$ determined from the data. Since it may be difficult to understand in words, we show how the equation is formulated.

$$\min_\pi \mathbb{E}_{z\sim p(z), \tau \sim p_{\pi_z}[D(I(\tau), z)]}$$

In this paper, we show that in a stochastic environment, solving the above problem with $I(\tau)$ as the return "does not result in successful learning", and clarify the properties of $I(\tau)$ desirable in a stochastic environment.

## RvS in a stochastic environment

### Why does normal RvS fail?

Why does RvS fail for measures conditioned on returns in a stochastic environment? Intuitively, it is because it is not possible to determine whether the returns obtained are due to the result of the action or to stochastic transitions. Let us consider the above MDP example [Figure 1]. The above transition diagram shows the distribution of the data. Consider the reward $r=1$, which is obtained 50% of the time when $a_1$ is taken, and 100% of the time when $a_2$ is taken. We see that for $a_2$, this reward is achieved as a direct result of the action, and for $a_1$, this reward is achieved by a stochastic transition. Suppose that we learn a conditional distribution based on this data. Conditioning the learned distribution on $r=1$, we find that both $a_1 and a_2$ take the value of probability. The optimal behavior is $a_2$, but even if $r=1$ is specified to obtain it, $a_1$ is also taken with a certain probability. This can be attributed to the fact that we used returns, which are statistics that cannot distinguish the results of stochastic transitions from those of actions, for training.

### Environmental randomness and independent statistics

In this section, we theoretically analyze the conditions necessary for RvS to succeed in a stochastic environment. First, we define successful learning as follows

Definition 2.1 (Consistently Achievable). A goal $z$ is consistently achievable from state $s_0$ under policy $\pi(a|s, z)$ if $\mathbb{E}_{\tau \sim p_{\pi_z}(\tau|s_0)}[D(I(\tau), z)] = 0$.

In other words, successful learning means that the trajectory $\tau$ induced by conditioning the learned strategy on a certain statistic $z$ achieves the same statistic $I(\tau)$ as the expected value of $z$. In terms of returns, this means that the learned strategy will achieve the expected return. Based on this, the following theorem describes the properties of the statistics necessary for successful learning.

Theorem 2.1. Let $\pi_D$ be a data collecting policy used to gather data to train an RvS policy, and assume this policy is trained such that $\pi( a_t|s_0, a_0, \dots, s_t, z) = p_{\pi_D}(a_t|s_0, a_0, \dots, s_t, I(\tau)=z)$. Then, for any goal z such that $p_{\pi_D}(I(\tau)=z|s_0)>0$, goal $z$ is consistently achievable iff $p_{\pi_D}(s_t|s_0, a_0, \dots , a_{t-1}) = p_{\pi_D}(s_t|s_0, a_0, \dots, a_{t-1}, I(\tau))$.

The condition required by this theorem is that the statistic $I(\tau)$ is independent of the randomness of the environment. That is, the statistic does not contain information on the randomness of the environment. If we can construct such a statistic, we can use existing RvS methods such as Decision Transformer to learn measures in a stochastic environment.

## Technique

Here, we describe a method they call ESPER ( Environment StochasticIndependent Representation ), which learns a statistic $I(\tau)$ independent of the dynamics of the environment. ESPER first learns a representation $I(\tau)$ of the statistic by adversarial clustering. Next, clusters are created based on the statistic, and average returns are computed for each cluster. Finally, the returns are used to perform the usual RvS [Figure 2].

### Hostile clustering

Adversarial clustering encourages $I(\tau)$ to become independent of the dynamics by learning $I(\tau)$ in an adversarial manner with respect to the dynamics estimation. The following is the model to be trained.

• Cluster model: $I(\tau) \sim p_{\theta}(I(\tau)|\tau)$
• Action predictor: $a_t \sim p_\theta(a_t|s_t, I(\tau))$
• Return predictor: $\hat{R}=f_{\psi}(I(\tau))$
• Transition predictor: $s_{t+1} \sim p_{\phi}(s_{t+1}|s_0, a_0, \dots, s_t, a_t, I(\tau))$

The following is the objective function.

$$L(\theta) = \mathbb{E}_{I(\tau)\sim p_\theta(I(\tau)|\tau)}\left[- \underbrace{\beta_{\mathrm{act}}} \log p_{\theta}(a_t|s_t, I(\tau ))}_{predict the action} + \underbrace{\beta_{\mathrm{adv}}} \log p_{\phi}(s_{t+1}|s_0, a_0, \dots, s_t, a_t, I(\tau))}_{ask to be adversarial to predict dynamics }\right]$$

$$L(\phi) = \mathbb{E}_{I(\tau)\sim p_\theta(I(\tau)|\tau)}\left[\underbrace{- \log p_{\phi}(s_{t+1}|s_0, a_0, \dots, s_t, a_t, I(\tau))}_{ Predicts the dynamics}\right]$$

You can see that the model for estimating returns is required to be adversarial to the estimation of dynamics, while contributing to the prediction of behavior.

### Estimated average returns for clusters

Next, using the learned representation model of the statistic $p_\theta(I(\tau)|\tau)$, we estimate the reward of the trajectory by $f_{\psi}(\tau)$. This estimates the average return of the trajectory with the statistic. Once the returns are estimated, we can use the estimated $p_{\theta}(I(\tau)|\tau)$ and $f_{\psi}(\tau)$ to obtain the estimated average returns for each trajectory $\tau$ in the data when clustered with the statistic $I(\tau)$ independent of randomness of probability. We can obtain an estimate of the average return when clustered by $I(\tau)$, a statistic that does not depend on the randomness of the probability.

$$L(\psi) = \mathbb{E}_{I(\tau)\sim p_\theta(I(\tau)|\tau)}\left[||R - f_{\psi}(I(\tau))||_2^2\right]$$

### RvS

Finally, we perform the usual RvS with the average returns of the clusters described above as conditioning factors.

## Experiment

The experiments in this study focus on the following three environments that include stochastic dynamics.

### Benchmark

• Gambling: The behavior illustrated above is the result of three environments.
• Connect Four: A four-in-a-row game with stochastic opponents. The reward is 1 for winning, 0 otherwise.
• 2048: A puzzle game involving tiles whose values and positions are determined by probability. The object is to create 2048 tiles by repeatedly sliding up, down, left, and right. In this experiment, we simplify the environment by assuming that the reward is 1 if the player can make 128 tiles, and 0 otherwise.

### Baseline

• Return-Conditioned RvS (DT): A conventional, return-conditioned Decision Transformer.
• CQL: A Value Function-Based Offline Reinforcement Learning Method.

### Result

As can be seen in Figure 3, ESPER has an advantage not only over Return-Conditioned RvS (DT) but also over CQL. The experimental results confirm that the performance of RvS with normal returns is lower in stochastic environments and that ESPER significantly improves the performance, and together with the previous analysis, the effectiveness of ESPER is theoretically and experimentally demonstrated.

## Summary

How was it? We have presented a paper proposing a new RvS method, ESPER, which works in a stochastic environment. RvS is well suited for large-scale data and is currently being actively studied. ESPER is expected to be applied to real-world problems where transitions cannot be estimated deterministically due to noise and other reasons. Please continue to follow the research trend of RvS.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.