# [DPO] A Method For Directly Matching Large-scale Language Models To User Preferences Without Using Reinforcement Learning

**3 main points**✔️ Matching LLM output to user preferences requires reflecting user feedback

✔️ Traditionally, reward models are trained based on user feedback, and large-scale language models are fine-tuned by reinforcement learning to maximize the reward (RLHF)

✔️ Proposal is to fine-tune large-scale language models directly from user feedback data to achieve stable optimization and lightweight computation (DPO)

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

written by Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn

(Submitted on 29 May 2023 (v1), last revised 13 Dec 2023 (this version, v2))

Comments: Published on arxiv.

Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## Introduction

A term that became famous at the same time as LLM is RLHF, which stands for Reinforcement Learning from Human Feedback, a technique for fine-tuning language models through reinforcement learning, literally, from human feedback.

Why is such fine tuning necessary? Because machines output answers that are faithful to human instructions, but conversely, without human instructions, they may not return the answers that humans intend.

For example, if the instruction is to maximize the speed of a car, the machine would be able to increase the speed faithfully to the instruction because speed is a mathematically definable value and can be quantitatively measured.

However, is speed the only important thing in a car? What if the car is fast but shakes badly? From a human's point of view, if the speed is fast but the shaking is so bad that it is hard to ride in the car, it would be unacceptable. People would say to the machine, "It is not enough that it is fast. If you think about it, you would understand," some people might say.

And what if the instruction is to maximize fun? It seems difficult to define and measure funniness mathematically. On the other hand, it is relatively easy to give a relative evaluation of which of two pairs of comedians was funnier, whether or not there is objectivity to convince many people. Also, it would be human to see the actual product in front of them and say, "It's not what I expected. Thus, getting direct feedback on the output is useful for getting closer to the intended answer.

The RLHF allows users to provide direct feedback on user preferences that have spilled over from the instructional requirements of these prior studies.

RLHF became famous at the same time as LLM because, in deploying it to the general public, it was necessary to stop people from pointing out and correcting errors generated by LLM, called Halcination, in the public, correcting details that people think are unnatural to natural answers, picking up biases in the data and making unethical statements We needed to stop people from doing this, and from spreading dangerous and harmful information as honestly as they were asked.

The technique described in this article is a fine-tuning technique based on user feedback that is more stable and lightweight than RLHF. It was selected as the runner-up (Two Outstanding Main Track Runners-Ups) to the top two papers (Two Outstanding Main Track Papers) at the top conference, NeurIPS 2023.

Let me explain how it works and its effects.

## Comparison of conventional RLHF and proposed DPO configurations

Figure 1 shows the difference in configuration between the conventional RLHF and the proposed DPO.

### Conventional RLHF

As shown in the figure, the traditional RLHF uses maximum likelihood estimation of the parameters of the reward model from preference data to predict rewards, and then assigns rewards to the output of the large-scale language model policy ( sample completions) are assigned rewards by the reward model (label rewards).

This process is repeated to learn the policy parameters of a large-scale language model by reinforcement learning so that the reward is maximized. At this point, I think it would be more appropriate to say that the LLM model parameters are learned to maximize the reward without using the word "policy," but I think this is appropriate because we use the term "policy" in reinforcement learning. A policy is a probability distribution of an action for a given state. Here, we can think of a state as an input x and an action as an output y. We can think of it as a distribution that represents the probability of y being generated for x.

Here, preference data is data that indicates which of two outputs is better for a given input to the LLM. For example, given an input x to the LLM, "write me a poem about the history of jazz," the LLM samples two outputs y1 and y2 for this input. w and y_l for the one that the user prefers, and (x, y_w, y_l) for the one that the user does not.

This preference data is used to train a model that returns a reward r for x. The reward model r_Φ uses a model called the Bradley-Terry model, where the probability of y1 being a higher reward than y2 is defined as the preference distribution p*=σ(r*(x,y_w)-r*Φ(x,y_l)), where σ is a sigmoid function. x,y_w)-r_Φ(x,y_l)) is maximized.

Now that the goodness of the LLM output can be evaluated as a reward, the measures of the large-scale language model are trained to maximize this goodness. However, we do not only optimize the expected value of the reward, but also maximize the reward while reducing the KL divergence between the two measures so that the measures of the original large-scale language model and the updated measures are not too far apart.

KL divergence is a measure of distance between distributions. Since the measures are probability distributions, we use the distance measure of probability distributions. This distance constraint is intended to prevent model collapse. If a large-scale language model overfits the user feedback, the model collapses, and even if the model is able to return the desired output for the user feedback, it is prevented from returning incorrect output for the other inputs. This is only intended to be a fine tuning of the LLM policy.

The transformer, which is the base model of LLM, is designed to learn a large number of model parameters as efficiently as a conventional neural network, so that the gradient of the loss function when the model parameters are changed can be calculated. The transformer, which is the base model of LLM, is designed so that the gradient of the loss function can be calculated when the model parameters change in order to learn a large number of model parameters as efficiently as conventional neural networks. However, since there is no known differentiable relationship equation between the learned reward and the model parameters of the LMM such that the gradient can be calculated, parameter optimization using the gradient cannot be performed as in neural networks.

PPO (Proximal Parameter Optimization) is a standard reinforcement learning method for LLM policy fine-tuning, which is suitable for optimization of continuous quantities with high-dimensional states and actions, just as it is for reinforcement learning of LLM policies. PPO is suitable for optimizing continuous quantities with high dimensional states and actions, and is suitable for reinforcement learning of LLM policies.

### Proposed DPO

The proposed DPO stands for Direct Preference Optimization, and as the name suggests, the key point is that it optimizes preferences directly. The DPO claims that RLHF performs LLM fine tuning indirectly through reinforcement learning, but that LLM fine tuning can be performed directly from preference data, as shown in the figure, without reinforcement learning.

Why does the DPO offer direct learning?

## Why DPO allows direct LLM fine tuning

The reason why DPO allows direct LLM fine tuning is that it leads to an equation that relates the reward model to the optimal policy (point 1), which makes the reward model and optimal policy equivalent and interchangeable. This means that the reward model and optimal policy are equivalent and interchangeable, so if there is a loss function for the reward model, we can change the variable to the loss function for the policy (Point 2). The RLHF minimizes the loss function of the reward model to obtain the reward model, whereas the DPO minimizes the loss function of the policy and learns the optimal policy.

### Point 1: Equation relating the reward function to optimal measures (DPO expresses true rewards in terms of optimal measures)

The optimization problem for reinforcement learning was Equation 1. It was an equation that included in the first term the output of the reward model and in the second term a constraint to prevent the strategy from changing too much.

In fact, the optimal solution of Equation 1 = optimal measure is, according to existing literature, Equation 2 below.

Equation 2 yields the relationship between the optimal measure and the original measure π_ref and the true reward function r.

So, in fact, there is an equation relating the reward model to the optimal policy in existing research, but here is one problem: estimating Z(x). Even if the true reward model is obtained by maximum likelihood estimation of r_Φ, the remaining Z(x) is tricky. It is a constant that normalizes the value of Z(x) so that it takes on a value between 0 and 1, so that it becomes a probability, since if it were left as it is, it would not satisfy the rule that probabilities take on values between 0 and 1.

Basically, it is a difficult value to compute because it requires calculating the sum of all values that are proportional to the probability with respect to a possible value. The Monte Carlo method is a way to obtain this. This method is based on the principle that if you extract a large number of samples from a distribution and sum them, you will get close to the true value, but it is very computationally time-consuming to improve the accuracy.

Therefore, we transform the left-hand side of Equation 2 into the reward model to form Equation 3. The true reward model is then the sum of the ratio of the optimal measure to the original measure and the distribution function.

### Point 2: Change to a loss function of measures instead of a loss function of rewards (learn measures as if you were learning the RLHF reward model)

If we further recall that the definition of the preference distribution p* in the Bradley-Terry model assumed in the RLHF was σ(r*(x,y_w) - r*(x,y_l)), and substitute Equation 3 in place of r*, we are happy to see that the distribution function, which was a nuisance to calculate, disappears by subtraction, as in Equation 4 below, and no calculation is required. The subtraction of r*(r*(x,y_l)) and substituting r*(r*(x,y_l)) in place of r*(r*(x,y_l)), as shown in Equation 4 below, will eliminate the need for calculation.

At the same time, since this is the same process as learning the reward model in the RLHF, we can estimate the policy model parameters that maximize p*, as in Equation 5.

Thus, the parameter of the preference distribution in the RLHF was "parameterized reward model", but it has been successfully changed to "parameterized measures". In DPO, however, the parameters of the policy are obtained by optimizing the parameters of the policy and minimizing this loss function. This eliminates the need for reinforcement learning, and the computation is lightweight.

The paper also argues that while existing research on LLM optimization by reinforcement learning uses human output as a baseline and normalizes rewards to stabilize the optimization, DPO does not require such normalization.

## Assessment Results

What are the actual evaluation results of conventional RLHF and DPO?

The results of the evaluation of the sentiment generation and summarization capabilities of the conventional RLHF and DPO are shown in Figure 2.

Since there are too many comparison methods, we will focus on the traditional PPO and the proposed DPO, with which we have been comparing mechanisms, and explain the results of the figures.

The left panel of Figure 2 shows the evaluation results of the sentiment generation task (IMDb Sentiment Generation). The evaluation is based on a movie and TV show rating database. x is a video review of the database, and the large-scale language model must output with positive sentiment. For a controlled experiment, a pre-trained sentiment classifier allows us to evaluate whether the sentiment is positive or not.

In the left panel of Figure 2, the horizontal axis is the distance between the original model and the updated model at the KL divergence, and the vertical axis is the reward. At every KL divergence, the ochre-colored DPO (Ours) has a higher reward compared to the pink-colored conventional PPO.

The DPO and RLHF optimization problems were KL divergence constrained optimization problems, but they are looking to see if the tradeoffs with the constraints are well balanced. Ideally, if the constraints are loosened (if the KL divergence is increased), the degree of freedom of the parameters to be optimized should increase, so that higher rewards are obtained (if there is an optimal solution within that degree of freedom, then the reward growth itself would then come to a head).

The DPO has shown that when the constraints are loosened, the rewards are sharply higher. I think we can say that it is able to maximize the reward well within the given degrees of freedom; it is able to achieve a higher reward than when the true reward model is given to RLHF. In short, since the learning part of the RLHF's reward model is made in an ideal state, even if the only evaluation that affects performance is whether it is optimized by reinforcement learning or not, the fact that the DPO is better suggests that the DPO is more efficient at optimizing than reinforcement learning.

The right figure in Figure 2 is the TL;DR Summarization, an evaluation using a database of summaries of Reddit TL;DR (Too Long;Didn't Read posts that were too long) with human preferences, where x is the Reddit forum post and y is the summary.

In the right panel of Figure 2, the horizontal axis is the Sampling Temprature (sample temperature) and the vertical axis is the win rate. Sample temperature is one of the parameters of LLM, and the larger it is, the more diverse responses are obtained for each sample. On the other hand, the smaller it is, the more similar responses are obtained even after repeated sampling. The win rate is the percentage of the responses of the compared methods that the GPT-4 evaluates as to which method gave the best response.

TL;DR Summarization shows that the proposed DPO (ochre) has a higher win rate than the conventional PPO (pink) regardless of sample temperature; the DPO peaks at sample temperature 0 and declines steadily, while the DPO maintains its highest win rate at sample temperatures below 1.

## At the end

In this presentation, we explained Direct Preference Optimization (DPO), a technique that enables fine tuning using preference data without using reinforcement learning, unlike conventional RLHF. In large-scale language models, a lot of attention has been paid to large amounts of data. In this technology, however, it was shown that the trial-and-error process of reinforcement learning by machines can be reduced by deriving relational equations by human thought. Rather than learning everything with data, it is likely to be more efficient for humans to introduce relationships that can be derived logically, without using data. To put it another way, is it possible to convert the parameters that are currently required into the parameters that we truly want to seek? It seems to be effective to think about this.

Categories related to this article