LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Dear Model Mergers, Have You Solved Merger of Different Model Families? (alien.top)

submitted 1 year ago by BayesMind@alien.top to c/localllama@poweruser.forum

13 comments fedilink hide all child comments

I've only seen merging of same-upstream-pretrained-model-at-same-size.

At very least, you should be able to merge any 2 models with the same tokenizer via element-wise addition of the log probs just before sampling. This would also unlock creative new samplers. IE instead of adding logprobs, maybe one model's logprobs constrains the other's in interesting ways.

But, 2 models with same architecture and same dataset will be heavily biased in the same direction, even if you take 2 different finetunes, so this approach seems like it will have a low ceiling of potential.

Also, if you're just doing a linear interpolation of same-dimensioned weights, why not just collapse them all into a normal-sized model? IE 70B + 70B should still == 70B.

That said, you would get much more interesting models if you allowed mergers of different architectures, trained from different initializations, and with different datasets. I would think that the research on "token healing" would allow you to merge any 2 models, even if they have different tokenizers.

This seems like a cool way forward.

top 13 comments

sorted by: hot top controversial new old

[–] FullOf_Bad_Ideas@alien.top 1 points 1 year ago (1 children)

I've only seen merging of same-upstream-pretrained-model-at-same-size.

Not anymore.

Here's a merge of llama 2 13B and llama 1 33B https://huggingface.co/chargoddard/llama2-22b

[–] 30299578815310@alien.top 1 points 1 year ago (2 children)

How does this work? Like I'm really confused at a conceptual level on how you merge models with different numbers of different sized layers.

[–] BayesMind@alien.top 1 points 1 year ago

reading the readme, it sounds like they're running some attention heads that were either already same-dimensioned across both models, or, they may have included a linear projection layer to accomplish it. Then, they say they trained on 10M tokens to "settle in the transplant", which doesn't sound like enough to me, and they concur this model isn't useful until further training.

[–] llama_in_sunglasses@alien.top 1 points 1 year ago

https://huggingface.co/chargoddard/llama2-22b/blob/main/frankenllama_22b.py shows how the tensors are padded up to fit

[–] a_beautiful_rhind@alien.top 1 points 1 year ago (1 children)

Wonder how L1 65b would do with L2 70b.

[–] BayesMind@alien.top 1 points 1 year ago

Not for the kind of merging I've seen. But I remember a paper back in the day that suggested you could find high-dimensional axes within different models, and if you rotated the weights to align, you could merge different models to your advantage, and maintain knowledge from both seed models. This included models that were trained from different initializations.

I think that the only reason this franken-merging works is because people are mostly just merging finetunes of the same base, so these high-d vectors are already aligned enough that the mergers work.

[–] llama_in_sunglasses@alien.top 1 points 1 year ago (2 children)

At very least, you should be able to merge any 2 models with the same tokenizer via element-wise addition of the log probs just before sampling. This would also unlock creative new samplers. IE instead of adding logprobs, maybe one model's logprobs constrains the other's in interesting ways.

What, run two models at once? This doesn't seem cost-effective for what you'd get.

Most merges that are popular are weight mixes, where portions of different models are averaged in increasingly complex ways. Goliath is a layer splice, sections of Xwin and Euryale were chopped up and interweaved together. This is the kind of merge I'm interested in but getting useful models out of the process is way more art than science.

[–] FPham@alien.top 1 points 1 year ago (1 children)

Call it a voodoo, not an art.

[–] llama_in_sunglasses@alien.top 1 points 1 year ago

I mean, voodoo and all forms of magic with a k are basically art in my opinion.

[–] BayesMind@alien.top 1 points 1 year ago (1 children)

This doesn't seem cost-effective for what you'd get.

I agree, which is why I'm bearish on model merges, unless you're mixing model families (IE mistral + Llama).

These franken-merges are just interweaving finetunes of the same base model in a way that, it'd make more sense to me if they just collapsed all params into a same-sized model via element-wise interpolation. So, merging weights makes sense, but running params in parallel like these X-120B, there's no payout I can see in doing that beyond collapsing the weights.

[–] llama_in_sunglasses@alien.top 1 points 1 year ago

If I prompt a frankenmerge with the usual instruct dreck I use, they fail to answer numerous questions in a useful manner. However, it's a different story using them in chat mode or probably anything creative - the outputs can be coherent but feel way less AI-like.

[–] mcmoose1900@alien.top 1 points 1 year ago (1 children)

Git rebasin claims to do this.

But its untested on large models. There is a branch for it in mergekit, as well as a stable diffusion implementation (which works fantastically as a regular merger).

[–] BayesMind@alien.top 1 points 1 year ago

rebasin! I was trying to recall this, thank you. Can it mix model families, do you know? I thought it was just for identical architectures.