Cross-Ensemble Representation Learning] Overcoming Diversity Challenges In Deep Reinforcement Learning

Neural Network 23/10/2024

3 main points
✔️ CERL improves performance of individual agents, aggregation policies
✔️ Improved learning efficiency through value function learning among ensemble members
✔️ Evaluation on Atari game and MuJoCo task confirms method effectiveness

The Curse of Diversity in Ensemble-Based Exploration
written by Zhixuan Lin, Pierluca D'Oro, Evgenii Nikishin, Aaron Courville
(Submitted on 7 May 2024)
Comments: Published as a conference paper at ICLR 2024
Subjects: Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Summary

The study found that search strategies, particularly those that use diverse ensembles of data-sharing agents, may theoretically improve search efficiency, but actually degrade the performance of individual agents. This phenomenon is attributed to the inability of ensemble members to learn efficiently because they rely on a low percentage of self-generated data. We name this problem the "curse of diversity" and provide a detailed analysis of its effects and countermeasures.

Related Research

The Related Research section of this paper discusses how ensemble-based search strategies have evolved in the field of deep reinforcement learning (Deep RL). In particular, the paper cites studies that propose ways to improve search efficiency by allowing multiple agents to share data while learning diverse policies. These studies aim to expand the state/action space explored by individual agents by simultaneously exploring several different policies during training, thus forming a more robust ensemble policy.

However, this paper focuses on the potential problems with these search strategies. A major challenge faced by ensembles of diverse agents is that learning efficiency is reduced because individual agents have a small percentage of self-generated data. This occurs because of the difficulty of "off-policy learning," learning from data generated by other agents. Furthermore, they point out that previous studies have not properly assessed this phenomenon, and they also present experimental results to clarify the extent to which the performance of ensemble members is inferior to that of individual agents.

Proposed Method Cross-Ensemble Representation Learning (CERL)

The proposed Cross-Ensemble Representation Learning (CERL) algorithm consists of the following steps The algorithm aims to overcome the curse of diversity through the auxiliary task of learning value functions among ensemble members. The following overview diagram illustrates the algorithm process.

1. initialize the ensemble
Have a separate policy and value function for each agent. These policies are initialized with parameters that can be independent or partially shared. In the overview diagram, these independent policies are depicted as a unique network structure for each agent.

2. data collection
Each agent collects data independently from the environment. This data is stored in a central replay buffer and is accessible by all agents. The figure shows how each agent collects a different set of data and sends it to the shared replay buffer.

3. setting up auxiliary tasks
In addition to learning the main task based on its own policy, each agent performs an auxiliary task that predicts the value functions of other agents. This allows for more generalized representation learning by understanding other agents' behavior patterns and value judgments. The figure depicts how the main head Qi _{(s, a)} and the auxiliary head Qji _{^{(s, a)}} work together among agents.

4. learning process
Through batch learning, each agent simultaneously optimizes the main task and the auxiliary task. The auxiliary task influences the loss function with the goal of accurately predicting the value functions of the other agents. The figure visually shows how these learning processes are integrated and interact.

5. policy updates and evaluation
Learned policies are periodically evaluated in the environment and performance is tracked. This provides a clear picture of the algorithm's progress and the effectiveness of each agent's learning. The overview diagram depicts how the different performance of each agent during the evaluation phase is measured.

Experiment

In this paper, an experimental evaluation of Cross-Ensemble Representation Learning (CERL) is performed on 55 Atari games and 4 MuJoCo tasks.CERL performance is compared to traditional Bootstrapped DQN, Ensemble SAC, and compared to single-agent Double DQN and SAC. Also included is a comparison with Bootstrapped DQN with network sharing. The results of the experiments are presented in the paper as Figure 7.

Atari games (top of Figure 7): CERL compared Bootstrapped DQN and Ensemble SAC to the reference in 55 Atari games; CERL achieved improvements in individual agent and ensemble-wide policies, especially in aggregated policies, with the best performance, especially in aggregated policies.

MuJoCo task (bottom of Figure 7): the impact of different replay buffer sizes was also examined, with CERL reducing the performance difference from about 2500 to 500 compared to SAC for Ensemble SAC with a 0.2M size replay buffer.

These experiments show that CERL mitigates the curse of diversity and improves performance for both individual agents and aggregated policies. The error bars shown in Figure 7 indicate a 95% bootstrap confidence interval, increasing the confidence in the results .

Conclusion

This study shows that Cross-Ensemble Representation Learning (CERL) is an effective method for mitigating the "curse of diversity" in deep reinforcement learning. through representation learning among ensemble members, CERL can improve not only the individual agents as well as the performance of aggregated policies. In the future, it is expected that this technique will be applied to more reinforcement learning tasks and contribute to the development of more efficient algorithms. Potential improvements through application to complex real-world environments and in combination with other learning strategies will also be explored.