this post was submitted on 25 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

https://x.com/kylemarieb/status/1728281581306233036

New DeepMind paper just dropped.

Background: Direct Preference Optimization (DPO) is the simpler, more robust, higher performing successor of RLHF used in Zephyr, Intel’s new model, and others.

Identity-PO simplifies DPO, removing its reliance on ELO scores (and the mathematical assumptions that come with them). The authors claim this solves overfitting, which is huge if true.

The trend towards simpler solutions and sounder mathematical grounding in alignment is fun to watch. These inscrutable matrices are looking awfully controllable, and the failure modes of the old methods were things like wedding party collapse.

top 6 comments
sorted by: hot top controversial new old
[–] Wonderful_Ad_5134@alien.top 1 points 1 year ago (1 children)

So that means that we can get even better finetunes in the future? Noice!

[–] georgejrjrjr@alien.top 1 points 1 year ago

It’s better than that, imo, when you look at it in context.

Particularly in light of Intel’s finding the other day, that DPO works well (probably better) without preference data.

“Alignment” methods are getting simpler, easier, and more effective.

RLHF was a huge pain, because there were a ton of hyper parameters to tweak, and it’s expensive to get human data.

Constitutional AI (RLAIF) dealt with some of the cost and difficulty by using AI preference data, but still left the necessity for collecting preference data, and all the hyper parameter tweaking intact.

DPO eliminated the superfluous reward model, simplifying things greatly, and making overfitting less pernicious.

Intel got rid of preference data altogether.

IPO claims to fix overfitting altogether, while simplifying further.

I figure within a month, Axolotl will grow a flag that means, “and also IPO this,” with no additional cognitive overhead or hyper-parameter tuning required, and —yes— the water line for model quality is going to go up.

[–] big_ol_tender@alien.top 1 points 1 year ago (1 children)

It’s already available in huggingface dpo trainer too

[–] georgejrjrjr@alien.top 1 points 1 year ago

Ty, that’s helpful to know.

[–] AutomataManifold@alien.top 1 points 1 year ago (1 children)

So how do I use this on my own dataset?

[–] georgejrjrjr@alien.top 1 points 1 year ago

https://www.reddit.com/r/LocalLLaMA/comments/183d0t6/comment/kap6r1c/?utm_source=share&utm_medium=web2x&context=3

Since it's already been integrated into Huggingface' trainer (per the linked comment above), you should be able to follow the the Huggingface alignment manual, with one (or two) small modifications:
* Optionally: instead of using preference data from UltraChat or whomever, you can use Intel's trick and just reject sample from a weaker model --perhaps the model you're finetuning, or you could use Llama 2 13b as Intel did. This just means that you're labeling (perhaps some subset of) your original training set examples as 'preferred' and the weaker model's completions of the same prompts as 'rejected'.
* Instead of using the DPO option on Huggingface's training library (used by 'TRL'), use the IPO option. That's it.