LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Open LLM Leaderboard vs Reality: How do you evaluate "good" ? (alien.top)

submitted 2 years ago by BlueMetaMind@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

As a beginner, I appreciate that there are metrics for all these LLMs out there so I don't waste time downloading and trying failures. However, I noticed that the Leaderboard doesn't exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.

I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.

I am interested in how more experienced people here evaluate an LLM's fitness. Do you have a battery of questions and instructions you try out first?

you are viewing a single comment's thread
view the rest of the comments

[–] dbinokc@alien.top 1 points 2 years ago

My primary interest in an LLM is coding and specifically java. I do have a series of questions I will test with. Generally involving generation of code based on json, creating simple examples in spring and database connectivity. Even though it is probably a bit dated, I have found openbuddy coder to work the best so far for open source llm's. Even beating out the newer open source models for my needs. I think it even does a respectable versus ChatGPT4 for my coding tasks.