this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I've only seen merging of same-upstream-pretrained-model-at-same-size.

At very least, you should be able to merge any 2 models with the same tokenizer via element-wise addition of the log probs just before sampling. This would also unlock creative new samplers. IE instead of adding logprobs, maybe one model's logprobs constrains the other's in interesting ways.

But, 2 models with same architecture and same dataset will be heavily biased in the same direction, even if you take 2 different finetunes, so this approach seems like it will have a low ceiling of potential.

Also, if you're just doing a linear interpolation of same-dimensioned weights, why not just collapse them all into a normal-sized model? IE 70B + 70B should still == 70B.

That said, you would get much more interesting models if you allowed mergers of different architectures, trained from different initializations, and with different datasets. I would think that the research on "token healing" would allow you to merge any 2 models, even if they have different tokenizers.

This seems like a cool way forward.

you are viewing a single comment's thread
view the rest of the comments
[–] FullOf_Bad_Ideas@alien.top 1 points 9 months ago (3 children)

I've only seen merging of same-upstream-pretrained-model-at-same-size.

Not anymore.

Here's a merge of llama 2 13B and llama 1 33B https://huggingface.co/chargoddard/llama2-22b

[–] 30299578815310@alien.top 1 points 9 months ago (2 children)

How does this work? Like I'm really confused at a conceptual level on how you merge models with different numbers of different sized layers.

[–] BayesMind@alien.top 1 points 9 months ago

reading the readme, it sounds like they're running some attention heads that were either already same-dimensioned across both models, or, they may have included a linear projection layer to accomplish it. Then, they say they trained on 10M tokens to "settle in the transplant", which doesn't sound like enough to me, and they concur this model isn't useful until further training.