# For Safe Reinforcement Learning In The Real World

3 main points
✔️ Safety-oriented methods for reinforcement learning
✔️ Pre-training in the source environment to avoid risky behaviors when learning in the target environment
✔️ Successfully curbed dangerous behavior in the target environment

Jesse ZhangBrian CheungChelsea FinnSergey LevineDinesh Jayaraman
(Submitted on 15 Aug 2020)

Comments: Accepted at ICML2020
Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML)

## Introduction

(Deep) reinforcement learning has been used successfully in a variety of domains, including games such as chess and Go. Such reinforcement learning has also been applied to real-world tasks, such as automated driving.

However, reinforcement learning in the real world can sometimes be very risky.

In reinforcement learning, the agent and the environment interact with each other. In the process, the learning process proceeds by actually choosing the correct or incorrect action to take. If it's a game, it doesn't involve problems with the agent making bad moves.

But what if it was self-driving?

Needless to say, a mistake by an automated driving system in the real world would mean a traffic accident, resulting in human and property damage. This makes reinforcement learning very difficult in situations where the worst-case scenario could involve very large losses. In these real-world learning situations, you will need a system that allows you to learn without having to choose risky behaviors. Cautious Adaptation in RL (CARL), presented in this article, addresses this problem by learning risky behaviors in advance and preventing risky behaviors during learning.

## prior knowledge

In the paper, CARL is built on a model-based reinforcement learning method called PETS. Therefore, we will discuss the main features of PETS first.

### Probabilistic dynamics model

In PETS, students learn an ensemble of stochastic dynamics models within an environment. In simple terms, multiple models are used to gather information about the learning environment. The models in the ensemble are then trained to be able to predict the distribution of the next state $s'$ from the current state $s$ and behavior $a$.

### choice of action

As an action selection method, we use sampling-based model-predictive control (Model-predictive control). It uses an evolutionary search to find the sequence of behaviors (sequence of actions) with the highest predictive reward.

### Compensation Calculation

At this stage, we perform a method called particle propagation (particle propagation method). The specific process is shown below.

Let the initial state be $s_0$ and the sequence of actions be $A=[a_1,a_2,...,a_H]$. Let $A=[a_1,a_2,...,a_H]$. In this case, $H$ is the amount of time (number of actions) until the episode ends. (In English, it is represented as the horizon.) Now, given the dynamics model $f$, we perform action a1 and predict the distribution of the state $s_1$ at the next time. This is repeated in step H to predict the final state $s_H$ for choosing an action according to $A$.

Instead of selecting an action in the real environment, we imagine a virtual $H$ step trial using the dynamics model f. This process is called state propagation. This process is repeated $N$ times to predict the state $\N_{s}i_H\{i=1}$ after the H step. The result of these N predictions is called particles. With the prediction reward $r_i$ assigned to each state $i∈[1,N]$, we obtain the behavioral score of $A$ as follows. $$R(A) =$$R(A) = arg max_A R(A)$such that$$A∗ = arg max_A R(A)$ such that this is the maximum, and then perform the first action $A_1$ of this $A*$. From the resulting state $s_1$, we repeat the whole process again to determine the action.to determine the action.