LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Open LLM Leaderboard vs Reality: How do you evaluate "good" ? (alien.top)

submitted 2 years ago by BlueMetaMind@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

As a beginner, I appreciate that there are metrics for all these LLMs out there so I don't waste time downloading and trying failures. However, I noticed that the Leaderboard doesn't exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.

I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.

I am interested in how more experienced people here evaluate an LLM's fitness. Do you have a battery of questions and instructions you try out first?

you are viewing a single comment's thread
view the rest of the comments

[–] Honato2@alien.top 1 points 2 years ago

The leaderboards are pretty much useless. trickery and training for the leaderboard kinda ruins the whole point of it.

First I have the model do some weird rp shit. namely impersonating the macho man randy savage and cutting a promo on a random subject. If it does well it gets 1 point. If it fails then -3 points.

Next is trying for a conversation with the same scoring system. If it stays coherent it passes. bonus points if it keeps character the entire time.

Lastly some simple coding things. If it works out of the box 3 points if it needs endless bug fixing -5.

With points scattered in or taken away arbitrarily based on a whim.

impersonation and cutting promos is pretty effective with the bonus perk of who the fuck would ever train a model to pass that test? It's a benchmark that is random enough to be possible and not trained to do. Also it's pretty entertaining usually.