this post was submitted on 20 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

As a beginner, I appreciate that there are metrics for all these LLMs out there so I don't waste time downloading and trying failures. However, I noticed that the Leaderboard doesn't exactly reflect reality for me. YES, I DO UNDERSTAND THAT IT DEPENDS ON MY NEEDS.

I mean really basic stuff of how the LLM acts as a coherent agent, can follow instructions and grasp context in any given situation. Which is often lacking in LLMs I am trying so far, like the boards leader for 30B models 01-ai/Yi-34B for example. I guess there is something similar going on like it used to with GPU benchmarks: dirty tricks and over-optimization for the tests.

I am interested in how more experienced people here evaluate an LLM's fitness. Do you have a battery of questions and instructions you try out first?

top 10 comments
sorted by: hot top controversial new old
[–] LoSboccacc@alien.top 1 points 10 months ago

I've a python script that runs a fixed dialogue with a bit of turn by turn instructions, comprehension tasks like recall or summarisation and a few reasoning. I package everything in vicuna format (user: assistant: ) then send it to gp4 where I ask: this is a chat between a user and an assistant, evaluate each assistant response individually for coherence and consistency and write a score in 10/10 and the problems you find, then I pick the minimum score of 10 samples.

[–] VertexMachine@alien.top 1 points 10 months ago (1 children)

You either do standardized benchmarks like that leader-board (which are useful but limited) or you have your application-specific benchmark. Most often the latter are very, very time&work consuming to do right. Evaluating NLP systems in general is very hard problem.

Some people use more powerful models to evaluate weaker ones. E.g., use GPT4 to evaluate output of llama. Depending on your task it might work well. I did recently an early version of experiment with around 20 models for text summarization, where GPT4 and I were evaluating summaries (on predefined scale, with predefined criteria of evaluation). I didn't calculate any proper measure of inter annotator agreement yet, but looking at the evalas side by side it's really high.

Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D

[–] BlueMetaMind@alien.top 1 points 10 months ago

Or if you are just playing around, you just write/search for a post on reddit (or various LLM related discords) asking for best model for your task :D

I made this post as an attempt to collect best practices and ideas.

use GPT4 to evaluate output of llama.

That's always a good option probably but I try to avoid using openAI all together.

[–] ttkciar@alien.top 1 points 10 months ago (1 children)

When I look at the leaderboard, I mostly pay attention to TruthfulQA, as it seems most predictive of models which are good for my use-case. YMMV of course.

Once I've downloaded a model, I'll fiddle around with different llama.cpp parameters and prompt templates, figuring out what works best for it, and then send it through my test framework, which has it infer five times on each of several prompts.

Evaluation of test results are fairly subjective, but there are some obvious problems which recur, like not inferring an answer, or inferring itself a new user prompt to answer.

I just finished a compare-and-contrast of Marx-3B vs Marx-3B-v3 using that test framework, which you can see (along with raw test results) here: https://old.reddit.com/r/LocalLLaMA/comments/17xsliz/marx_3b_v3_and_akins_3b_gguf_quantizations/ka2fd19/

I've been meaning to add some simple assessment logic to my test framework, which tries to guess at the quality of inferred replies, but haven't made it a priority.

[–] BlueMetaMind@alien.top 1 points 10 months ago

What are the top 3 best open source LLMs in your opinion?

[–] WolframRavenwolf@alien.top 1 points 10 months ago (1 children)

I test and compare models in-depth, still hard at work on my 70B-120B evaluation. Take a look at one of my recent posts, where I explain my testing methodology in detail.

[–] No-Belt7582@alien.top 1 points 9 months ago

You are famous everywhere for those comparisons.

[–] dbinokc@alien.top 1 points 10 months ago

My primary interest in an LLM is coding and specifically java. I do have a series of questions I will test with. Generally involving generation of code based on json, creating simple examples in spring and database connectivity. Even though it is probably a bit dated, I have found openbuddy coder to work the best so far for open source llm's. Even beating out the newer open source models for my needs. I think it even does a respectable versus ChatGPT4 for my coding tasks.

[–] Honato2@alien.top 1 points 10 months ago

The leaderboards are pretty much useless. trickery and training for the leaderboard kinda ruins the whole point of it.

First I have the model do some weird rp shit. namely impersonating the macho man randy savage and cutting a promo on a random subject. If it does well it gets 1 point. If it fails then -3 points.

Next is trying for a conversation with the same scoring system. If it stays coherent it passes. bonus points if it keeps character the entire time.

Lastly some simple coding things. If it works out of the box 3 points if it needs endless bug fixing -5.

With points scattered in or taken away arbitrarily based on a whim.

impersonation and cutting promos is pretty effective with the bonus perk of who the fuck would ever train a model to pass that test? It's a benchmark that is random enough to be possible and not trained to do. Also it's pretty entertaining usually.

[–] BalorNG@alien.top 1 points 10 months ago

Technically, you can somewhat automate the testing process by creating a script that makes that model aswer a series of questions that are relevant to YOU and are unique (so cannot be gamed by training for benchmarks) and evaluate those yourself.

Make sure you experiment using different sampling methods and run several tests due to inherent randomness of output.