this post was submitted on 25 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

https://x.com/kylemarieb/status/1728281581306233036

New DeepMind paper just dropped.

Background: Direct Preference Optimization (DPO) is the simpler, more robust, higher performing successor of RLHF used in Zephyr, Intel’s new model, and others.

Identity-PO simplifies DPO, removing its reliance on ELO scores (and the mathematical assumptions that come with them). The authors claim this solves overfitting, which is huge if true.

The trend towards simpler solutions and sounder mathematical grounding in alignment is fun to watch. These inscrutable matrices are looking awfully controllable, and the failure modes of the old methods were things like wedding party collapse.

you are viewing a single comment's thread
view the rest of the comments
[–] georgejrjrjr@alien.top 1 points 1 year ago

https://www.reddit.com/r/LocalLLaMA/comments/183d0t6/comment/kap6r1c/?utm_source=share&utm_medium=web2x&context=3

Since it's already been integrated into Huggingface' trainer (per the linked comment above), you should be able to follow the the Huggingface alignment manual, with one (or two) small modifications:
* Optionally: instead of using preference data from UltraChat or whomever, you can use Intel's trick and just reject sample from a weaker model --perhaps the model you're finetuning, or you could use Llama 2 13b as Intel did. This just means that you're labeling (perhaps some subset of) your original training set examples as 'preferred' and the weaker model's completions of the same prompts as 'rejected'.
* Instead of using the DPO option on Huggingface's training library (used by 'TRL'), use the IPO option. That's it.