this post was submitted on 12 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Look at this, apart Llama1, all the other "base" models will likely answer "language" after "As an AI". That means Meta, Mistral AI and 01-ai (the company that made Yi) likely trained the "base" models with GPT instruct datasets to inflate the benchmark scores and make it look like the "base" models had a lot of potential, we got duped hard on that one.

https://preview.redd.it/vqtjkw1vdyzb1.png?width=653&format=png&auto=webp&s=91652053bcbc8a7b50bced9bbf8638fa417387bb

top 9 comments
sorted by: hot top controversial new old
[–] mcmoose1900@alien.top 1 points 10 months ago (1 children)

The problem is trusting these common benchmarks in the first place... And VCs making investing decisions based on them.

It's insane. Its like a years old, published SAT test is the only factor for getting a job or an investment, and no one bothered to check if you're just blatently cheating instead of cleverly cheating.

[–] Wonderful_Ad_5134@alien.top 1 points 10 months ago

I know right, getting that much investment on something you can easily cheat makes me sick

[–] a_beautiful_rhind@alien.top 1 points 10 months ago

GPT slop gonna GPT slop.

I hate that phrase so much too. Even if they used anything else. Some think they're being clever and change it to "as an AI".

[–] FPham@alien.top 1 points 10 months ago

Shouldn't be the proof in the pudding?

If Mistral 7B is better than most other 7b models, then they did something right, no?

I understand that the base model then can inherit some biases - but it's onto them that they didn't cleaned those "As and AI..." answers strings from their dataset. So despite this, it performs better.

[–] trailer_dog@alien.top 1 points 10 months ago

So it turns out you just need to train on GPT output for better benchmarks lol. Not to say there's a chance GPT models are contaminated with benchmark test data too. "Distillation" went a little too far. Easy VC money though, I would do the same.

[–] Wonderful_Ad_5134@alien.top 1 points 10 months ago

Llama2 has been pre-trained on old data (before the chatGPT AI poisoning was significant)

https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md

"Data Freshness The pretraining data has a cutoff of September 2022, but some tuning data is more recent, up to July 2023."

"Model Dates Llama 2 was trained between January 2023 and July 2023."

StableLM3b has been trained on more recent datasets (cutoff of march 2023) yet it doesn't have this amount of chatgpt poisoning in it

https://huggingface.co/stabilityai/stablelm-base-alpha-3b-v2

https://preview.redd.it/gl46fo50n10c1.png?width=518&format=png&auto=webp&s=c7cae52b292dcba45dee735a4ca7efac5630a4ae

[–] phree_radical@alien.top 1 points 10 months ago

Interestingly, Mistral Instruct:

As an AI

### top_k:

0.686088: 13892 "assistant"
0.049313: 28725 ","
0.039010:  3842 "language"
0.037810:  2229 "model"
0.031591: 28733 "-"
0.018000:  3332 "research"
0.016518:  1587 "system"
0.009266: 21631 "Assistant"
0.006967:  7583 "expert"
0.005598:  3921 "tool"
0.004394:  8073 "agent"
0.004242:   369 "that"
0.002696:   304 "and"
0.002644:   297 "in"
0.001415:  5716 "student"
0.001410:  5514 "technology"
0.001197:  7786 "coach"
0.001073:  1918 "team"
0.001073: 24480 "scientist"
0.001052:  2818 "based"
0.001036:  2007 "program"
0.000925: 12435 "bot"
0.000819:  5181 "platform"
0.000819: 28723 "."
0.000816: 21782 "developer"
0.000813:  6031 "assist"
0.000806:  3327 "personal"
0.000803:  9464 "algorithm"
0.000776:  2488 "project"
0.000746:   354 "for"
0.000743:  8626 "teacher"
0.000666:  7511 "eth"
0.000645:  6953 "writer"
0.000640: 24989 "practition"
0.000623:  3441 "voice"
0.000621:  5024 "professional"
0.000611: 22275 "analyst"
0.000588: 15589 "Language"
0.000583:  8252 "virtual"
0.000531:  7153 "digital"
0.000525:   298 "to"
0.000523: 11108 "technique"
0.000523: 10706 "chat"
0.000521: 19899 "specialist"
0.000517:  8311 "tut"
0.000501:  1338 "person"
0.000493:  6878 "experiment"
0.000474:   325 "("
0.000460: 18112 "engineer"
0.000458:  4993 "application"
[–] arekku255@alien.top 1 points 10 months ago

"As an AI language model" is pretty much a meme at this point.

A base model catching on to it is disappointing but not completely unexpected.

In fact there have been several sources in the past highlighting the upcoming issue of ChatGPT creeping into future datasets and here we are with proof of what we were warned about 6+ months ago having now happened.

[–] metaprotium@alien.top 1 points 10 months ago

It's almost a shame chatGPT blew up in the way that it did. "AI" became a buzzword and every company found a way to shove it into their business model. Now the future of NLP is cloudy because it's become an ouroboros of data. I think dataset selection and cleaning will become a more important area of research. I'd be surprised if "shoving terabytes of raw webscraper data" continues being feasible in the future