Catch up on the latest AI articles

Two Rabbits, One Rabbit: The Trade-off Between Adjusting Controllable Models And Improving Performance

Two Rabbits, One Rabbit: The Trade-off Between Adjusting Controllable Models And Improving Performance

Computation And Language

3 main points
✔️ We propose a method to set priorities of objectives in artificial intelligence (AI) tuningand to tune the model according to those priorities.
✔️ Experiments have shown that the SFT (Single Factorial Technique), DPO (Dual Process Outcome), CPSFT (Conditional Single Factorial Technique), and CPO (Conditional Process Outcome) were evaluated for their controllability. Results showed that CPSFT and CPO were more controllable than the other methods.
✔️ Future research is needed to validate the practicality and effectiveness of CPO for further real-world applications and industrial deployment, as more complex coordination targets and new control methods can further improve the performance of CPO.

Controllable Preference Optimization: Toward Controllable Multi-Objective Alignment
written by Yiju Guo, Ganqu Cui, Lifan Yuan, Ning Ding, Jiexin Wang, Huimin Chen, Bowen Sun, Ruobing Xie, Jie Zhou, Yankai Lin, Zhiyuan Liu, Maosong Sun
(Submitted on 29 Feb 2024)
Comments: Published on arxiv.

Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)

code:  

The images used in this article are from the paper, from the introductory slides, or were created based on them.

summary

Artificial intelligence (AI) adjustments focus on matching model responses with human preferences and values. However, human preferences are complex, and improving one objective may come at the expense of another. We call this the "adjustment tax. The adjustment methods used so far work only in one direction and are inflexible for some objectives. Therefore, this paper proposes a method to optimize priorities. It sets priorities for different objectives and adjusts the model according to those priorities. Experiments have shown thatthis method produces responses that match preferences such as"usefulness," "honesty," and "harmlessness" (3H). In addition, the use of diverse data and objectives yields better results than traditional methods, reduces the impact of adjustment taxes, and improves coordination for multiple objectives.

Introduction.

Large-scale language models (LLMs) are very useful as human AI assistants, and it is important that they operate in accordance with human preferences and values. Previous research has proposed a "3H" coordination goal of useful, honest, and harmless LLMs. However, these goals are complex and sometimes conflicting. For example, the dilemma is that a helpful LLM should not refuse to answer dangerous questions. Previous studies have taken approaches that address this problem, but have not fully resolved it.

A new algorithm, controllable priority optimization (CPO), is proposed in this study to achieve multiple goals simultaneously. This algorithm controls the behavior of the LLM based on explicit priority conditions and can balance multiple goals.

(a) In multi-objective optimization, attempts to optimize multiple goals often result in conflicts among them.

(b) Suppose that in controllable generation, H1 is related to usefulness and H2 is related to honesty; if only H1 is provided, the direction of optimization is restricted to a plane. On the other hand, if both H1 and H2 are provided, the direction of optimization is restricted to a straight line.

Related Research

LLMs have a lot of knowledge, but they do not understand human intent and need to be adjusted before they can be implemented in a real system. Previous research has focused on improving usefulness and harmlessness, but has not focused on tuning for honesty. Recent research has trained LLMs with supervised fine-tuning to address questions that cross knowledge boundaries by rejecting or expressing uncertainty.Adjustment also involves a problem known as the adjustment tax. This refers to the possibility that LLMs may compromise on certain aspects. Safety coordination with jailbreak response has been considered to address this issue, but excessive safety training may render the model unresponsive. Therefore, it is important to mitigate the trade-off between multi-objective optimization.

Research on controllable alignment during inference is also ongoing. Customized generation based on specific objectives is considered, and various methods have been proposed to align with different objectives. The approach in this paper focuses on reducing inconsistencies between multiple alignment goals.

Proposed Method

The proposed method, the Controllable Preferred Optimization (CPO) algorithm, allows multiple goals to be considered and adjusted simultaneously in training AI models that reflect human values and preferences.

The figure above shows the overall framework for controllable priority optimization.

First, the CPO algorithm determines the direction in which to adjust the model's behavior via preference tokens. This allows the model to be controlled to behave appropriately for a particular goal or condition.one of the main ideas of the CPO algorithm is to transform a multi-objective optimization problem into a conditional multi-objective optimization problem. This allows multiple goals and conditions to be optimized simultaneously. Specifically, we define objective functions that represent human values and preferences and train the model to maximize them simultaneously. This allows the model to be tuned to match multiple values.The CPO algorithm also includes two stages: controllable priority monitoring fine-tuning and controllable direct priority optimization. Controllable priority monitoring fine-tuning fine-tunes the model to account for priority conditions, while controllable direct priority optimization controls direct priorities and adjusts multiple goals simultaneously.

Combined, these methods allow models to respond appropriately to human values and preferences and adapt to complex situations; the CPO algorithm is a promising method for improving the performance and flexibility of AI systems.

experiment

A controllable preferential optimization (CPO) algorithm was proposed and its performance evaluated.

Evaluation of "3H" indicators (usefulness, integrity, and harmlessness)

In the dataset and base model setup, we used datasets such as UltraFeedback and UltraSafety to train safe and controllable models; in the CPSFT phase, we trained the models to enhance their multi-turn interaction.

The experiment evaluated the controllability of SFT (Single Factorial Technique), DPO (Dual Process Outcome), CPSFT (Conditional Single Factorial Technique), and CPO (Conditional Process Outcome) and the results showed that CPSFT and CPO showed better control than the other methods.

Multi-purpose coordinated evaluation of CPO

The same alignment data were used to evaluate the effect of CPO and compare it to baselines such as Zephyr-7Bbeta, Mistral-7B-Instructv0.2, WizardLM-7B, and LLaMA2-7B-Chat.

Results showed that CPO performed better than DPO, particularly achieving higher safety scores while maintaining usefulness and integrity.This experiment showed that the CPO algorithm can effectively control for aspects of usefulness, integrity, and harmlessness, and can achieve multiple goals simultaneously.

Pareto optimum valuation

The CPSFT and CPO were compared to two baselines to assess performance on the dimensions of usefulness, integrity, and harmlessness. Experts used trained responses and explored tradeoffs to explore the highest score on each aspect. Results showed that CPO performed better than other methods.The performance tradeoffs for usefulness (H1), honesty (H2), and harmlessness (H3) are as follows.

(a-c): specialized models trained on a subset of the highest 3H ratings

(d): SFT model trained on a mixture of subsets of the highest rated

(e-f): CPSFT and CPO models trained on the dataset

sensitivity analysis (e.g. in simulations)

The impact of two important hyperparameters on the usefulness and integrity objectives was investigated. It revealed tradeoffs between the importance of the objectives and the relationship between controllability and performance maximization.

Investigate the effects of various values of λ and ω on the controllability and performance of the model. As the value of λ increases, controllability is enhanced, initially improving effectiveness and then decreasing. At ω = 0.4, a satisfactory balance between usefulness and good faith is achieved.

case study

Demonstrated the controllability of the CPO. The ability of the model in different scenarios was demonstrated, showing that it can generate responses tailored to the user's values.These experimental results demonstrate that the CPO algorithm is effectively controllable in terms of usefulness, honesty, and harmlessness, and can improve the performance of the model in a variety of scenarios.

Conclusion.

This paper introduces a new method that addresses the performance trade-offs in tuning large-scale language models (LLMs). The method, called Controllable Preference Optimization (CPO), combines both supervised fine-tuning and preference learning; evaluations of CPO have confirmed that it exhibits excellent flexibility and performance in terms of usefulness, honesty, and harmlessness.

Controllable Preference Optimization (CPO) is an important advance in LLM coordination. Further improvements and extensions of CPO are expected in the future. For example, the performance of CPO can be further improved by introducing more complex tuning targets and new control techniques. Further research is also needed to validate the practicality and effectiveness of CPO for further real-world applications and industrial deployment.

 
  • メルマガ登録(ver
  • ライター
  • エンジニア_大募集!!

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us