Multi-agent Reinforcement Learning Algorithm That Can Handle Increasing Or Decreasing Number Of Agents

Reinforcement Learning 07/10/2022

3 main points
✔️ Proposed a multi-agent reinforcement learning algorithm "MA-POCA" that can handle increasing and decreasing the number of agents in the environment.
✔️ Support variable-length input to Critic by using Attention
✔️ Significantly outperforms existing methods on tasks where agents are created and destroyed in an episode and on standard multi-agent coordination tasks

On the Use and Misuse of Absorbing States in Multi-agent Reinforcement Learning
written by Andrew Cohen, Ervin Teng, Vincent-Pierre Berges, Ruo-Ping Dong, Hunter Henry, Marwan Mattar, Alexander Zook, Sujoy Ganguly
(Submitted on 10 Nov 2021 (v1), last revised 7 Jun 2022 (this version, v2))
Comments: AAAI 2022
Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

outline

In many real-world scenarios, agents need to cooperate to achieve a common goal. In these settings, single-agent reinforcement learning (RL) methods may exhibit suboptimal performance or worse as the number of agents increases. Multi-agent reinforcement learning (MARL) methods address these issues by using centralized training and decentralized execution. In this, the agents act using local observations, but globally available information is used during training.

So far, existing MARL methods have assumed that the number of agents in the environment is fixed for training. However, this is often unsuitable for many practical applications of MARL. For example, agents in a team-based video game may "spawn" (i.e., be created) or "die" (i.e., disappear before other agents) within a single episode. Also, in addition to games, any of the robots working in a team may run out of battery power and terminate before their teammates (i.e., other robots). In general, existing algorithms handle such situations by placing inactive agents in an absorbing state.
*Absorption state: A state where once you enter, you cannot leave (like the rightmost or leftmost state in the figure below).

Agents remain absorbing until the entire group of agents reaches the termination condition, regardless of their choice of action. While the absorbing state allows learning to occur with a fixed number of inputs to the Critic, it can also be viewed as a waste of information, which becomes more pronounced as the number of agents increases.
The key challenge posed by premature termination of agents is what we call Posthumous Credit Assignment. Agents who are excluded from the environment cannot experience the rewards given to the group after early termination, and thus cannot know whether their actions before termination were valuable to the group. To solve this problem, we propose a MARL algorithm that propagates value even if the agent terminates early. Specifically, for the existing MARL algorithm COunterfactual Multi-Agent Policy Gradients (COMA ), we propose a novel architecture Multi-Agent We proposed a novel architecture Multi-Agent POsthumous Credit Assignment (MA-POCA ), which uses Attention instead of a full-coupling layer with absorbing states, within the framework of centralized training and decentralized execution. The attention mechanism can be extended to an arbitrary number of agents by applying it only to the active agent information before input to the Critic.

background knowledge

decentralized-POMDP (Decentralized Partially Observed Markov Decision Process)

First, let's talk about distributed - partially observed Markov decision processes. This is a multi-agent extension of the partial observation Markov decision process familiar with single-agent reinforcement learning. The symbolic definitions used are as follows.

Number of agents N( ≥ 1)
State space of environment S
Joint observation space of all agents $O$ := $O_1$ × ... × $O_N$
- Observation space of agent i $O^i$.
Joint action space of all agents $A$:= $A^1$ × x ... × $A^N$
- The action space of agent i $A^i$.
State transition function P : S × A × S → [0, 1] .
Shared reward function r: S × A → R

Centralized Training, Decentralized Execution

We consider the framework of Independent Actor with Centralized Critic (IACC), in which a Critic learned based on joint information updates a set of independent Actors in actor-critic. As a comparative approach, we consider Independent Actor-Critic (IAC), where each agent learns independent Critic and Actor using only local information, and Joint ActorCritic, where a single JointActor and JointCritic (JAC), which learns a single JointActor and JointCritic. IAC is a localized method. However, IAC uses only local observations and does not work well for tasks that require large coordination. Also, JAC can be thought of as a large single-agent problem, but it is often impractical in real-world scenarios because the JointActor needs to access the observations of all agents at once to generate its actions. (*The figure below is an illustration of Centralized Training, Decentralized Execution)

Counterfactual Baselines

The counterfactual-virtual baseline is a COunterfactual Multi-Agent Policy Gradients (COMA) which was introduced in a paper proposing the MARL method called "COunterfactual Multi-Agent Policy Gradients (COMA)". The baseline is introduced so that the advantage function reflects "how much an individual agent contributed to the shared reward (group reward)". Specifically, we use a state action-value function that marginalizes the actions of individual agents (*As the name counterfactual implies, we are calculating the state action value of an agent "if it had taken a different action from the one it took") and

as the baseline, and the advantage function

the gradient of agent i's measure

is calculated. This allows us to calculate how much each agent has contributed to the group's shared reward.

Challenges when agents terminate early

Posthumous Credit Assignment

In a cooperative environment with shared rewards, agents act in a way that maximizes the group's future rewards. There are often cases (e.g., self-sacrificing events) where an individual agent's current behavior leads to the immediate termination of the agent itself, although it leads to a group reward at a later time step. In this case, from the perspective of the reinforcement learning agent, it is unable to observe the state of the environment when the group receives the reward, in addition to being excluded from the environment and thus unable to receive any rewards that the group may later obtain. Thus, the agent must learn to maximize the reward it cannot experience. We call this the Posthumous Credit Assignment problem.

Absorbing States

In Dec-POMDPs (distributed Markov decision processes), we treat $o^{abs}_i$ as the absorbing state when agent i reaches the termination state and is no longer active in the environment. Once agent i enters $o^{abs}_i$, it will remain there, regardless of its behavior, until the group reaches the termination state and all agents are reset to a new initial state, so that

Also, when agent i is in state $o^{abs}_i$, the transition function is independent of the agent's action.

*$a^{-i}$$ is the joint action of all agents except agent i.
Thus, it is easy to express the configuration of agents to be annihilated or created by introducing the absorbing state, but the following problems arise when using it.

The number of absorbing states is not constant, which complicates the training of function approximators based on neural networks
Computational resources must be applied even to agents that do not affect the environment

MA-POCA

To address the above issues, we proposed a novel multi-agent reinforcement learning method called MA-POCA. It is an improvement on COMA (Foerster et al. 2018), which uses self-attention for active agents before critique network input, thereby addressing the problem of posthumous credit assignment without using absorbing states. This approach addresses the problem of posthumous credit assignment.

MA-POCA Value Function (Update of Critic)

In this setting, the number of active agents depends on the time step t. Therefore, let $k_t$ denote the number of active agents at time step t such that 1 ≤ $k_t$ ≤ N (where N is the maximum number of agents that can survive at any given time), the output of Critic (the state value function) is

The following is the result. where $g_i(o^i_t)_{1\leq i \leq k_t}$ is the encode observations of all active agents and RSA is ResidualSelfAttention. Then, we define the objective function as

The training will be performed as follows. $k_{t+n}$ is the number of agents active at time t+n. The $k_{t+n}$ can be larger or smaller than $k_{t}$. This is because at time step t, any number of agents may be terminated early or new agents may be created.

MA-POCA Counterfactual Baseline (Actor Updates)

$1\leq i\leq k_t, i\neq j$ We learn the counterfactual baseline of an agent j by learning a value function conditional on the observation-action pairs of all agents i such that Here, we consider observations and observation-action pairs as separate entities and, as in the Critic's update, we use RSA blocks and observation and observation-action encoders to learn the baseline of agent j as

then the advantage function of agent j is defined to be

is the same as Note that $y^{(\lambda)}$ is the same as the one used to update Critic.

experiment

We empirically evaluate MA-POCA in four multi-agentenvironmentsand compare its performance to that of COMA, a multi-agent reinforcement learning method, and PPO, a single-agent reinforcement learning method. Toolkit and the remaining one is from Multi-Agent Particle Environments. A detailed description of the environment follows.

(a) Collaborative Push Block: Agents (blue, yellow, purple) push the white block to the green area. Larger blocks require more agents to push
(b) Simple Spread: Agents (purple) must move to cover the target (black) without bumping into each other
(c) Baton Pass: The blue agent grabs the green FOOD and presses the green button to spawn another agent and grab the next FOOD, and so on!
(d) Dungeon Escape: The blue agent must defeat the green dragon and sacrifice one of them to get the key out. Your teammates must pick up the key and get to the door while avoiding the pink dragon!

The experimental results are as follows: MA-POCA outperformed the other methods in all four environments, especially in the environments with agent creation and annihilation ((c) and (d)). Conversely, PPO failed to find optimal strategies in all four environments and converged to a local optimum. This may be due to partial observability brought about by decentralized learning and action. We also find that MA-POCA converges faster than COMA, which may be due to the inefficient input representation of Critic in COMA due to the absorbing state.

Conclusion.

In this paper, we present a novel multi-agent reinforcement learning method to solve the Posthumous Credit Assignment ) problem, a novel multi-agent reinforcement learning method for solving the problem. MA-POCA is proposed. In conventional MARL, this problem is handled by applying an absorbing state to agents that terminate early. MA-POCA, however, can train agents without using an absorbing state by using Attention. In our experiments, we demonstrated that MA-POCA outperforms COMA and PPO on the MARL task. In the future, we plan to investigate possible forms of other algorithms in the distributed POMDP framework for problems where the maximum number of agents N is unknown.