LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

Training on the rephrased test set is all you need: 13B models can reach GPT-4 performance in benchmarks with no contamination detectable by traditional methods (lmsys.org)

submitted 10 months ago by Covid-Plannedemic_@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] its_just_andy@alien.top 1 points 10 months ago (2 children)

if you're interested in running your own models for any reason, you really should build your own evaluation dataset for the scenarios you care about.

at this point, all the public benchmarks are such a mess. Do you really care if the model you select has the highest MMLU? Or, do you care only that it's the best-performing model for the scenarios you actually need?

[–] shibe5@alien.top 1 points 10 months ago

With the abundance of models, most developers and users have to select a small subset of available models for own evaluation, and that has to be based on some already available data about models' performance. At that stage, selecting models with, for example, highest MMLU score is one way to go about it.

load more comments (1 replies)