this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

There has been a lot of movement around and below the 13b parameter bracket in the last few months but it's wild to think the best 70b models are still llama2 based. Why is that?

We have 13b models like 8bit bartowski/Orca-2-13b-exl2 approaching or even surpassing the best 70b models now

top 35 comments
sorted by: hot top controversial new old
[–] jeffwadsworth@alien.top 1 points 9 months ago

The 13b’s don’t surpass the 70b Airoboros model. Not even close.

[–] BeginningMacaroon374@alien.top 1 points 9 months ago

You need at least 4 A100 for inference

[–] arekku255@alien.top 1 points 9 months ago

No point to release a model that hardly anyone can run.

13B and 7B can be run by the majority of users, 70B not so much...

[–] zBlackVision11@alien.top 1 points 9 months ago (3 children)

Qwen 72b is comming in 2 days 👍 Will be a real beast.

[–] FaustBargain@alien.top 1 points 9 months ago (1 children)

Qwen 72b

I can't seem to find anything about qwen 72b except two tweets from a month ago that said it was coming out. who makes it? what's it trained on? any details?

[–] Thireus@alien.top 1 points 9 months ago

Curiously nobody from the previous comment upvoters have provided an answer to your question.

[–] a_beautiful_rhind@alien.top 1 points 9 months ago (1 children)

I heard, if it comes out then finally it might be worth exllama supporting it. I heard the 14b was fairly strong.

[–] zBlackVision11@alien.top 1 points 9 months ago

Yes I also hope it get's exllamav2 support, here is a issue regarding it: (Qwen model not supported) · Issue #160 · turboderp/exllamav2 (github.com)

[–] ninjasaid13@alien.top 1 points 9 months ago

2 days? Bro if they said November and haven't released it by now, it's not two days.

[–] Antique_Elk9380@alien.top 1 points 9 months ago

Diminishing returns and cost of compute.

If people saw better returns from larger models, there would be more.

[–] thereisonlythedance@alien.top 1 points 9 months ago (2 children)

I've been training a lot lately, mostly on RunPod, a mix of fine-tuning Mistral 7B and training LoRA and QLoRAs on 34B and 70Bs. My main takeaway is that the LoRA outcomes are just... not so great. Whereas I'm very happy with the Mistral fine-tunes.

I mean, it's fantastic we can tinker with a 70B at all, but it doesn't matter how good your dataset is, you just can't have the same impact as you can with a full finetune. I think this is why model merging/frankensteining has become popular, it's an expression of the limitations of LoRA training.

Personally, I have high hopes for a larger Mistral model (in the 13-20B range) that we can still do a full fine-tune on. Right now, between my own specific tunes of Mistral and some of the recent external tunes like Starling I feel like I'm close to having the tools I want/need. But Mistral is still 7B, it doesn't matter how well it's tuned, it will still get a little muddled at times, particular with longer term dependencies.

[–] Armym@alien.top 1 points 9 months ago (1 children)

Do you think that finetuning models with more parameters requires more data to actually do something?

[–] thereisonlythedance@alien.top 1 points 9 months ago

With a full finetune I don't think so -- the LIMA paper showed that 1000 high quality samples is enough with a 65B model. With QLoRA and LoRA, I don't know. The number of parameters you're affecting is set by the rank you choose. It's important to get the balance between the rank, dataset size, and learning rate right. Style and structure is easy to impart, but other things not so much. I often wonder how clean the merge process actually is. I'm still learning.

[–] Vilzuh@alien.top 1 points 9 months ago

I have been trying to learn about fine-tuning and lora training for the past couple weeks but I'm having trouble finding easy enough resources to learn from. Could you give me some pointers to what I can read to get started with finetuning llama2 or mistral?

I have tried training quantized models locally with oobabooga and llama.cpp and I also have access to runpod. Really appreciate any info!

[–] a_beautiful_rhind@alien.top 1 points 9 months ago (1 children)

What do you mean? Someone just posted 100,200 and 600b models and several 120b models have released past couple of weeks.

[–] Slimxshadyx@alien.top 1 points 9 months ago

Those models can’t be accessed, they say it’s “too dangerous to be released”

[–] __JockY__@alien.top 1 points 9 months ago (3 children)

It took 3,311,616 hours of training for the llama2 70b base model. At $1/hour for an A100 GPU you’d spend just over $3M and it would take approximately 380 years to train the model.

Scale that across 10,000 GPUs and you’re looking at 2 weeks and a couple of million dollars.

Fine tune training is much, much faster and cheaper.

[–] Exotic-Estimate8355@alien.top 1 points 9 months ago (2 children)

$1/hour for an A100 ? Where? I can barely get one in GCE and it’s almost 4$ / hr

[–] toothpastespiders@alien.top 1 points 9 months ago

I'd like to know too if there's one for exactly $1. Even half a buck or so difference builds up over time.

But runpod's close at least, at $1.69/hour.

[–] __JockY__@alien.top 1 points 9 months ago

Yes, but you don't have Meta's purchasing power to rent 10,000 GPUs for a month. Economies of scale, my friend!

[–] __JockY__@alien.top 1 points 9 months ago

I'll reply to myself!

It's not just about GPU expense. You need a small team of ML data scientists. You need access to (or a way to scrape/generate) a mind-bogglingly broad dataset. You need to clean, normalize, and prepare the dataset. All of this takes a huge amount of expertise, time and money. I wouldn't be at all surprised if the auxiliary costs surpassed the GPU rental cost.

So the main answer to your question "Why is no one releasing 70b models?" is: it's really, really, really expensive. Other parts of the answer are: lack of expertise, difficulty of generating a good dataset, and probably a hundred things I haven't thought of.

But mainly it just comes down to cost. I bet you wouldn't see any change from $5,000,000 if you wanted to make your own new 70b base model.

[–] ninjasaid13@alien.top 1 points 9 months ago (2 children)

How much would that be in H100s or H200s?

[–] MerePotato@alien.top 1 points 9 months ago

About tree fiddy

[–] __JockY__@alien.top 1 points 9 months ago
[–] WaterPecker@alien.top 1 points 9 months ago

Who pays for all this training on all these models we see knocking about and I don't mean the ones released by the big companies? Like who has the resources to train a 70b model? Like one of the guys below said 1.7 million GPU hours for example thats pretty friggin expensive no?

[–] ChiefBigFeather@alien.top 1 points 9 months ago

13b models magically being better then 70b models is a myth. Most of the 7b or 13b model headlines are just clickbait, the models being good at benchmarks because they where trained on benchmark data.

Try Airo 70b 3.1.2, it is much, much better (for general purposes) then 99% of models out there. Yi based models are strong if you want the larger context.

[–] ambient_temp_xeno@alien.top 1 points 9 months ago

Orca still memeing strong.

[–] Markon101@alien.top 1 points 9 months ago

Google just released a 1.8T model that's partially trained. Would need a ton of H100's though just to run it, forget training it lol.

[–] SativaSawdust@alien.top 1 points 9 months ago

Look at the market share of video cards with more than 100GB of Vram.

[–] candre23@alien.top 1 points 9 months ago (1 children)

It's adorable that you think any 13b model is anywhere close to a 70b llama2 model.

[–] alcalde@alien.top 1 points 9 months ago

Oooh! Model fight! I'll try it out and post results later.

[–] extopico@alien.top 1 points 9 months ago

The problem with 70B is that it is incrementally better than smaller models, but is still nowhere near competitive with GPT-4, so it is stuck in no man’s land.

Once we finally get an open source model or architecture that can spar even with GPT-4, let alone 5, there will be much more interest in large models.

Regarding Falcon Chat 180B, it’s no better in my tests and for my use cases than fine tuned Llama 2 70B, which is a shame. It makes me think that there is something fundamentally wrong with Falcon, besides the laughably small context window.

[–] obvithrowaway34434@alien.top 1 points 9 months ago

Mistral has already shown that it's mostly about the data rather than the model. So why waste loads of money and time on training something that no average consumer can run locally?

[–] alcalde@alien.top 1 points 9 months ago

70B models can't run on a CPU with 32GB.

[–] DataLearnerAI@alien.top 1 points 9 months ago

Ali opensouced a 72B model called Qwen-72B: Qwen/Qwen-72B · Hugging Face

It supports Chinese and English. The performance on MMLU is remarkable.