Illustrious_Sand6784

joined 11 months ago
 

Didn't see any posts about these models so I made one myself.

This first set of models was trained with 288B high quality tokens, will be interesting if the 51B and 102B models hold up. Commercial use is allowed with no authorization.

https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/README-EN.md

(Chinese) https://github.com/IEIT-Yuan/Yuan-2.0

Paper: https://arxiv.org/abs/2311.15786

Huggingface download links

https://huggingface.co/pandada8/Unofficial-Yuan-2.0-2B

https://huggingface.co/pandada8/Unofficial-Yuan-2.0-51B

https://huggingface.co/pandada8/Unofficial-Yuan-2.0-102B

Here's the second set of models I found. 7B and 65B were trained with 2.6T tokens, and the 13B with 3.2T. The 65B model supports up to 16K context, while the two smaller ones support up to 8K.

https://huggingface.co/xverse/XVERSE-65B

https://huggingface.co/xverse/XVERSE-13B

https://huggingface.co/xverse/XVERSE-7B

These models know 40 over human languages plus several programming languages too. Commercial use is allowed, but you have to submit an application form.

[–] Illustrious_Sand6784@alien.top 1 points 11 months ago (1 children)

No they haven't, on the 220B model it's always been that message above, while on the 600B model it's a message similar to the one you stated.

 

https://huggingface.co/deepnight-research

I'm not affiliated with this group at all, I was just randomly looking for any new big merges and found these.

100B model: https://huggingface.co/deepnight-research/saily_100B

220B model: https://huggingface.co/deepnight-research/Saily_220B

600B model: https://huggingface.co/deepnight-research/ai1

They have some big claims about the capabilities of their models, but the two best ones are unavailable to download. Maybe we can help convince them to release them publicly?

[–] Illustrious_Sand6784@alien.top 1 points 11 months ago

Golaith-120B (specifically the 4.85 BPW quant) is the only model I use now, I don't think I can go back to using a 70B model after trying this.

[–] Illustrious_Sand6784@alien.top 1 points 11 months ago

I'm quite impressed with Goliath so far, so thank you and everyone who helped you make it.

The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

My suggestion is that it would be best to try out a 4-bit QLoRA fine-tune first and see how it preforms before spending the money/compute required to do a full fine-tune of such a massive model and have it possibly turn out to be mediocre.

On a related note, I've been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I'm now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we'd need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

I do think small models have a lot of good uses already and still have plenty of potential, especially ones that were created or fine-tuned for a specific task. I'm also sure we'll get a general purpose 1-10B parameter model that's GPT-4 level within 5 years, but I really don't see any 7B parameter model outperforming a good Llama-2-70B fine-tune before the second half of 2024 unless there's some big breakthrough. So I'd really encourage you to do some more research in this direction, as there's plenty of others working on improving small models, but barely anyone doing research on improving large models.

I know that it requires a lot of money and compute to fine-tune large models, that the time and cost increase the larger the model is, and that a majority of people can't run large models locally. I know those are the main reasons why large models don't get as much attention as smaller ones, but come on, there's a new small base model every week now, while I was stuck with LLaMA-65B for like half a year, and then I was stuck with Llama-2 70B for months, and now the only better model (which might not actually be much better, still waiting for the benchmarks...) that was only very recently released is not really even a base model, as it's a merge of two fine-tuned Llama-2-70B models. Mistral-70B may not even be available to download and won't be under a free license, and Yi-100B will be completely proprietary and unavailable to download, which leaves no upcoming models besides Llama-3-70B that are likely to outperform Llama-2-70B.