BayesMind

joined 1 year ago
 

What a deluge lately!

Deepseek 67B is an amazing coder. Their ~30B model was awesome. Now this!

Qwen 72B. Qwen had awesome models, I expect a lot from this.

Qwen 1.8B for those on a diet.

Qwen Audio for arbitrary audio -> "reasoned" text, so not just transcription.

XVERSE 65B. I haven't played with this series, how is it?

AquilaChat2-70B. I'm also not familiar.

Those are all heavy hitter foundation LLMs (and all from China).

One more noteworthy LLM is RWKV. He's releasing increasingly large versions as they train. It's an RNN (no transformers) that per-parameter-count competes with the transformers, but has O(c) memory and time complexity for long context windows. Also far lighter to train.

Then for Images, Stability has been on a roll:

Stable Video Stable Diffusion VIDEO. First OS video model I've seen.

SD-XL Turbo for Stable Diffusion XL but fast enough to spit out an image per keystroke as you type.

Stability also has an ultrafast upscaler that should come out any day now (Enhance!).

Russia's Kandinsky is staying fresh with updates.

Fuyu is a noteworthy img->text because it has a simple architecture that tokenizes images (as opposed to CNN), and allows for arbitrarily-sized images.

For audio:

Whisper v3 recently landed for awesome transcription.

Facebook's MusicGen for music.

Some Text-To-Speech that I'm forgetting now.

For making use of all this:

UniteAI is an OSS project I've been involved with to plug local models (llms, stt, RAG, etc) into your text editor of choice. Updates forthcoming.

llama.cpp is a leader in running arbitrary LLMs, especially heavily quantized ones that can run on CPU+RAM instead of GPU+VRAM.

ComfyUI is tying all the image+video gen into a web UI.

Lastly, i'm struggling to find an example, but I've seen some GH projects tie together the latent space of multiple different models with different modalities, to create multi-modal models. They do this by lightly training a projection layer between latent spaces. So, we could be close to amazing Multimodal models.

I know I'm missing tons, but these are the highlights on my radar. How about you?

[–] BayesMind@alien.top 1 points 11 months ago

rebasin! I was trying to recall this, thank you. Can it mix model families, do you know? I thought it was just for identical architectures.

[–] BayesMind@alien.top 1 points 11 months ago

Not for the kind of merging I've seen. But I remember a paper back in the day that suggested you could find high-dimensional axes within different models, and if you rotated the weights to align, you could merge different models to your advantage, and maintain knowledge from both seed models. This included models that were trained from different initializations.

I think that the only reason this franken-merging works is because people are mostly just merging finetunes of the same base, so these high-d vectors are already aligned enough that the mergers work.

[–] BayesMind@alien.top 1 points 11 months ago

reading the readme, it sounds like they're running some attention heads that were either already same-dimensioned across both models, or, they may have included a linear projection layer to accomplish it. Then, they say they trained on 10M tokens to "settle in the transplant", which doesn't sound like enough to me, and they concur this model isn't useful until further training.

[–] BayesMind@alien.top 1 points 11 months ago (1 children)

This doesn't seem cost-effective for what you'd get.

I agree, which is why I'm bearish on model merges, unless you're mixing model families (IE mistral + Llama).

These franken-merges are just interweaving finetunes of the same base model in a way that, it'd make more sense to me if they just collapsed all params into a same-sized model via element-wise interpolation. So, merging weights makes sense, but running params in parallel like these X-120B, there's no payout I can see in doing that beyond collapsing the weights.

 

I've only seen merging of same-upstream-pretrained-model-at-same-size.

At very least, you should be able to merge any 2 models with the same tokenizer via element-wise addition of the log probs just before sampling. This would also unlock creative new samplers. IE instead of adding logprobs, maybe one model's logprobs constrains the other's in interesting ways.

But, 2 models with same architecture and same dataset will be heavily biased in the same direction, even if you take 2 different finetunes, so this approach seems like it will have a low ceiling of potential.

Also, if you're just doing a linear interpolation of same-dimensioned weights, why not just collapse them all into a normal-sized model? IE 70B + 70B should still == 70B.

That said, you would get much more interesting models if you allowed mergers of different architectures, trained from different initializations, and with different datasets. I would think that the research on "token healing" would allow you to merge any 2 models, even if they have different tokenizers.

This seems like a cool way forward.

[–] BayesMind@alien.top 1 points 11 months ago

We need a different flair for New Models vs New Merge/Finetune

[–] BayesMind@alien.top 1 points 11 months ago

If you want to benchmark the largest open source model, Google recently released a 1.6T model: https://huggingface.co/google/switch-c-2048

 

I maintain the uniteai project, and have implemented a custom backend for serving transformers-compatible LLMs. (That file's actually a great ultra-light-weight server if transformers satisfies your needs; one clean file).

I'd like to add GGML etc, and I haven't reached for cTransformers. Instead of building a bespoke server, it'd be nice if a standard was starting to emerge.

For instance, many models have custom instruct templates, which, if a backend handles all that for me, that'd be nice.

I've used llama.cpp, but I'm not aware of it handling instruct templates. Is that worth building on top of? It's not too llama-only focused? Production worthy? (it bills itself as "mainly for educational purposes").

I've considered oobabooga, but I would just like a best-in-class server, without all the other FE fixings and dependencies.

Is OpenAI's API signature something people are trying to build against as a standard?

Any recommendations?