this post was submitted on 25 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
https://www.reddit.com/r/LocalLLaMA/comments/183d0t6/comment/kap6r1c/?utm_source=share&utm_medium=web2x&context=3
Since it's already been integrated into Huggingface' trainer (per the linked comment above), you should be able to follow the the Huggingface alignment manual, with one (or two) small modifications:
* Optionally: instead of using preference data from UltraChat or whomever, you can use Intel's trick and just reject sample from a weaker model --perhaps the model you're finetuning, or you could use Llama 2 13b as Intel did. This just means that you're labeling (perhaps some subset of) your original training set examples as 'preferred' and the weaker model's completions of the same prompts as 'rejected'.
* Instead of using the DPO option on Huggingface's training library (used by 'TRL'), use the IPO option. That's it.