[DPO] A Method For Directly Matching Large-scale Language Models To User Preferences Without Using Reinforcement Learning
[DPO] A Method For Directly Matching Large-scale Language Models To User Preferences Without Using R ...
RLHF
See more▼