LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

New Microsoft codediffusion paper suggests GPT-3.5 Turbo is only 20B, good news for open source models? (alien.top)

submitted 2 years ago by obvithrowaway34434@alien.top to c/localllama@poweruser.forum

27 comments fedilink hide all child comments

Wondering what everyone thinks in case this is true. It seems they're already beating all open source models including Llama-2 70B. Is this all due to data quality? Will Mistral be able to beat it next year?

Edit: Link to the paper -> https://arxiv.org/abs/2310.17680

https://preview.redd.it/kdk6fwr7vbxb1.png?width=605&format=png&auto=webp&s=21ac9936581d1376815d53e07e5b0adb739c3b06

you are viewing a single comment's thread
view the rest of the comments

[–] xadiant@alien.top 1 points 2 years ago (4 children)

No fucking way. GPT-3 has 175B params. In no shape or form they could have discovered the "secret sauce" to make an ultra smart 20B model. TruthfulQA paper suggests that bigger models are more likely to score worse, and ChatGPT's TQA score is impressively bad. I think the papers responsible for impressive open-source models are max 12-20 months old. Turbo version is probably quantized, that's all.

[–] Tkins@alien.top 1 points 2 years ago

Have you read the orca paper?

[–] FairSum@alien.top 1 points 2 years ago

The main question is why price it so far below Davinci level, which is 175B?

There's still a lot of room for models to be trained on more data. Take a look at the Llama papers - at the time training was stopped the loss was still going down. Mistral is on par with L2 13B to L1 30B and it's a measly 7B model. If GPT-4 truly has a dataset of 13T tokens, the scaling law equations from the Chinchilla paper illustrate that a 20B model trained on 13T tokens would reach lower loss levels than a 70B model trained on 2T tokens. Llama 1 already illustrated that a 7B model could outperform previous open source models (GPT-J-6B, Fairseq-13B, GPT-NeoX-20B, OPT-66B) just by virtue of training on more data and it's the reason the Llamas are so good to begin with

Model size is important, sure, but there are a lot of important things besides model size when it comes to training a good model

[–] Combinatorilliance@alien.top 1 points 2 years ago

I think it's plausible. Gpt3.5 isn't ultra smart. It's very hood most of the time, but it has clear limitations.

Seeing what mistral achieved with 7b, I'm sure we can get something similar to gpt3.5 in 20b given state of the art training and data. I'm sure OpenAI is using some tricks as well that aren't released to the public.

[–] AdamEgrate@alien.top 1 points 2 years ago (1 children)

Scaling laws suggest that you can reduce parameter count by increasing the number tokens. There is a limit however and that seems to be at around 32% of the original model size: https://www.harmdevries.com/post/model-size-vs-compute-overhead/

So that would put the resulting model at around 56B. Not sure how they got it down further, maybe through quantization.

[–] FairSum@alien.top 1 points 2 years ago

The scaling laws have quite a bit more wiggle room if you're willing to accept less benefit for your buck at training time. They mention that it isn't a hard threshold but more like a region where you can expect diminishing returns, which is true. The thing the original Chinchilla paper didn't emphasize is that diminishing returns aren't really "diminishing". Yes, you have to put in more training compute to reach a given level of quality, but more often than not training compute pales in comparison to inference compute, since whereas the former is a large cost you pay once and then you're done, the latter is a continuous cost you pay for as long as you host your LLM. Given enough time, inference compute will always pull ahead of training compute.

If you take a look at the scaling equations they used (the exact constants used may vary between model architectures and datasets, but they still give a reasonably good approximation) we have, for a model with N parameters and a dataset size of D tokens the loss is given by (see eq. 10 in 2203.15556.pdf (arxiv.org) ):

L(N, D) = 1.69 + 406.4 / N^0.34 + 410.7 / D^0.28

If you were to take Llama 2 70B's values and plug them in, we'd end up with:

L(70*10^9, 2*10^12) = 1.69 + 406.4 / (70*10^9)^0.34 + 410.7 / (2*10^12)^0.28 = 1.9211

By comparison, if we were to take Turbo's values and plug them in (here I'll use 13T training tokens, since that's the popular estimate for GPT-4's training set size so I'll assume they used it for Turbo as well) we'll end up with:

L(20*10^9, 13*10^12) = 1.69 + 406.4 / (20*10^9)^0.34 + 410.7 / (13*10^12)^0.28 = 1.905

So in this case, Turbo actually does end up coming ahead of Llama 2 by virtue of the larger training corpus. It also means that if future models significantly increase the pretraining dataset size (whether that's Llama 3, Llama 4, Mistral, or some other one) there's a very real chance that smaller models can reach this level of quality in the future