LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods (lmsys.org)

submitted 2 years ago by Covid-Plannedemic_@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

top 10 comments

sorted by: hot top controversial new old

[–] DreamGenX@alien.top 1 points 2 years ago

It's inevitable people will game the system when it's so easy, and the payoff can be huge. Not so long ago people could still get huge VC checks for showing off GitHub stars or benchmark numbers.

[–] Maykey@alien.top 1 points 2 years ago

phi-CTNL 2

[–] ambient_temp_xeno@alien.top 1 points 2 years ago

To be fair, it's pretty clear that openai update their models with every kind of test people throw at them as well.

[–] its_just_andy@alien.top 1 points 2 years ago (2 children)

if you're interested in running your own models for any reason, you really should build your own evaluation dataset for the scenarios you care about.

at this point, all the public benchmarks are such a mess. Do you really care if the model you select has the highest MMLU? Or, do you care only that it's the best-performing model for the scenarios you actually need?

[–] Exios-@alien.top 1 points 2 years ago

This seems to me at least like the most logical conclusion. I’m currently working on developing some level of moral/ethical dilemma scenarios to interpret different perspectives and response strategies, for my personal use cases of discussion and breaking down topics into manageable levels and then exploring the nuances, it is very effective. Seems to be far too broad of a “use case” to define one set of benchmarks unless it’s incredibly comprehensive and refined over and over as trends develop

[–] shibe5@alien.top 1 points 2 years ago

With the abundance of models, most developers and users have to select a small subset of available models for own evaluation, and that has to be based on some already available data about models' performance. At that stage, selecting models with, for example, highest MMLU score is one way to go about it.

[–] SlowSmarts@alien.top 1 points 2 years ago

Huh...I figured this has already been happening for a while on closed dataset LLMs. The leaderboard has not directly indicated a models ability to do real-world work from my experience. Some of the lower ranking models seem to do better with what I put them through than the top ranking models. Just my personal opinion and observation.

[–] amroamroamro@alien.top 1 points 2 years ago

When a measure becomes a target, it ceases to be a good measure

[–] LienniTa@alien.top 1 points 2 years ago

yeah people praising 7b and 13 b models here and there, but....they just hallucinate! Then 120b goliath, no matter how terrible its initial idea was, is just really good in normal conversations. Im trying to love giga praised open hermes 2.5 and other mistral finetunes, but they are just better next-token-predictors, unlike larger models which are actually able to reason.

[–] LosingID_583@alien.top 1 points 2 years ago

Benchmark test questions can't be made public. It's too easy to cheat.