Models Reward Themselves And Train Themselves!

Self Rewarding 28/07/2024

3 main points
✔️ Propose an approach that incorporates LLM-as-a-Judge prompts and allows LLMs themselves to earn rewards during learning
✔️ Self-Instruction creation and iterative learning with Instruction following traing allow for model self Enables self-improvement of models
✔️ Comparative experiments show that the AlpacaEval2.0 leaderboard outperforms many existing models

Self-Rewarding Language Models
written by Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Shkhbaatar, Jing Xu, Jason Weston
(Submitted on 18 Jan 2024(v1), last revised 8 Feb 2024 (this version, v2))
Comments: Published on arxiv.
Subjects: Computation and Language (cs.CL); Artificial Intelligence(cs.AI)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

Introduction

In recent years, a number of studies have been conducted to improve the performance of Large Language Models (LLMs), such as ChatGPT, and it has recently become clear that pre-training an LLM using preference data (data shared by a company with its customers, which includes personal information such as budget and purchasing habits) can significantly improve model performance. It has recently become clear that pre-training LLMs with reference data (data that a company shares with its customers, including personal information such as budgets and purchasing habits) can significantly improve the performance of the model.

On the other hand, a major problem with these approaches is that the size and quality of the data is a bottleneck because the models must be trained from human-prepared data.

In this paper, we describe a paper in which Self-Rewarding Language Models, a self-rewarding model in which the language model itself acquires rewards during learning and performs iterative learning using LLM-as-a-Judge prompts, eliminates bottlenecks in data size and quality, and outperforms many existing models in comparative experiments. The paper will explain how the Self-Rewarding Language Models outperform many existing models through experiments.

LLM-as-a-Judge

First, we explain LLM-as-a-Judge, which is used in the method proposed in this paper.

LLM-as-a-Judge is an automated evaluation technique based on LLM, which has recently attracted attention as an evaluation method for generative AI, and is used in this paper in the prompt format shown in the figure below.

This prompt instructs the model to evaluate thequality of a given response usingfive criteria (relevance, coverage, usefulness, clarity, and expertise ).

Self-Rewarding Language Models

An overview of the Self-Rewarding Language Models proposed in this paper is shown in the figure below.

As shown in the figure, Self-Rewarding Language Models consists of two steps: Self-Instruction creation andInstruction following training.

Self-Instruction creation

In this step, model _Mt receives the newly generated prompts (Generated new prompts), from whichmodelMtgenerates high-quality responses (Generate responses).

Model _Mt also predicts its own reward (Generate reward) at this time via the aforementioned LLM-as-a-Judge prompt, and this reward is used in the next step.

Instruction following training

In this step, a new dataset (Preference pairs) is created from the data generated by the LLM-as-a-Judge prompt and used for training via DPO (Direct Preference Optimization), from which the model _Mt+1 is generated and the next iteration of the model is trained.

This step is repeated many times, starting with a seed model, and in each iteration, candidate responses are generated by the model for the newly created prompts, and the same model reward is assigned.

The author states in the paper that "this process removes bottlenecks that limit the LLM model.

Experimental Setup

In this paper, Llama-2-70B was used as the base model, and experiments were conducted using two sets of data ,IFT Seed Data andEFT Seed Data.

The IFT Seed Data will be a sampling of only the first conversational portions of high-quality English conversations based on human-annotated ranks from 3200 conversational examples according to the Open Assistant dataset.

In addition, in this paper, a model fine-tuned from the base model using only this data is referred to as the SFT baseline and is used in the comparison experiments.

The EFT Seed Data is the Open Assistant dataset divided into a training set and an evaluation set, with LLM-as-a-Judge applied.

In addition, in order to compare the performance of the proposed model on the two axes of ability to follow instructions and ability as a reward model, this paper uses the AlpacaEval evaluation prompt, and in accordance with existing research, for 256 test prompts from various sources, the GPT-4 evaluator.

In addition, this paper also reports the results of an evaluation with 805 prompts on the AlpacaEval2.0 leaderboard.

Result

The results of the experiment with diverse prompts are shown in the figure below. (_M1,_M2, and_M3= Repetitions of learning were repeated 1, 2, and 3 times, respectively. )

Experimental results show that Self-Rewarding _M1 is comparable to SFT Baseline. (30.5% vs. 30.9%)

On the other hand, Self-Rewarding _M2 significantly outperforms the SFT Baseline (49.2% vs. 14.5%), and Self-Rewarding _M3 shows an even larger gap. (62.5% vs. 9.8%)

In addition, in the _M1 vs. _M3, _M1 vs. _M2, and _M2 vs. _M3 results, the model with the highest number of training iterations won, respectively, demonstrating a significant improvement in model performance at each iteration.

Next, the table below shows the results of the experiment on the AlpacaEval2.0 leaderboard. (Win Rate = Win Rate against GPT-Turbo)

From the table, we can see that the Win Rate improves with each repetition of the study: 9.94% for _M1, 15.38% for _M2, and 20.44% _{for M3}.

In addition, the _M3 model outperformed many existing models, including Claude 2, Gemini Pro, and GPT-4 0613, in terms of Win Rate.

Summary

How was it? In this article, we have described a paper in which Self-Rewarding Language Models, a self-rewarding model in which the language model itself acquires rewards during learning through LLM-as-a-Judge prompts and performs iterative learning, eliminates the bottleneck of data size and quality and outperforms many existing models through comparative experiments. The paper showed that Self-Rewarding Language Models outperform many existing models in comparative experiments.

While the experiments conducted in this paper demonstrate that iterating learning with Self-Rewarding Models is effective, there is a caveat that this experiment only involved up to three iterations.

The author's future research agenda includes understanding the Scaling Law (the law that the greater the number of LLM parameters and the size of the data set, the higher the performance) when increasing the number of iterations and when using more or less capable language models in different settings. The following is a list of the most important

As mentioned in the paper, while the performance gains from iterations in this method are likely to be saturated in realistic scenarios, it opens the door to the possibility of continuously improving the model without data constraints, and we are very much looking forward to further progress in this area.

The details of the Self-Rewarding Models and experimental results presented in this paper can be found in this paper for those who are interested.

Categories related to this article

田中侑李