LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

New Microsoft codediffusion paper suggests GPT-3.5 Turbo is only 20B, good news for open source models? (alien.top)

submitted 2 years ago by obvithrowaway34434@alien.top to c/localllama@poweruser.forum

27 comments fedilink hide all child comments

Wondering what everyone thinks in case this is true. It seems they're already beating all open source models including Llama-2 70B. Is this all due to data quality? Will Mistral be able to beat it next year?

Edit: Link to the paper -> https://arxiv.org/abs/2310.17680

https://preview.redd.it/kdk6fwr7vbxb1.png?width=605&format=png&auto=webp&s=21ac9936581d1376815d53e07e5b0adb739c3b06

you are viewing a single comment's thread
view the rest of the comments

[–] xadiant@alien.top 1 points 2 years ago (5 children)

No fucking way. GPT-3 has 175B params. In no shape or form they could have discovered the "secret sauce" to make an ultra smart 20B model. TruthfulQA paper suggests that bigger models are more likely to score worse, and ChatGPT's TQA score is impressively bad. I think the papers responsible for impressive open-source models are max 12-20 months old. Turbo version is probably quantized, that's all.

[–] FairSum@alien.top 1 points 2 years ago

The main question is why price it so far below Davinci level, which is 175B?

There's still a lot of room for models to be trained on more data. Take a look at the Llama papers - at the time training was stopped the loss was still going down. Mistral is on par with L2 13B to L1 30B and it's a measly 7B model. If GPT-4 truly has a dataset of 13T tokens, the scaling law equations from the Chinchilla paper illustrate that a 20B model trained on 13T tokens would reach lower loss levels than a 70B model trained on 2T tokens. Llama 1 already illustrated that a 7B model could outperform previous open source models (GPT-J-6B, Fairseq-13B, GPT-NeoX-20B, OPT-66B) just by virtue of training on more data and it's the reason the Llamas are so good to begin with

Model size is important, sure, but there are a lot of important things besides model size when it comes to training a good model

load more comments (4 replies)