LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Any Easy and Local Way to Run Benchmarks? (alien.top)

submitted 2 years ago by xadiant@alien.top to c/localllama@poweruser.forum

7 comments fedilink hide all child comments

I want to see if some presets and custom modifications work well in benchmarks, but running HellaSwag or MMLU looks too complicated for me, and it takes 10+ hours to upload 20GBs of data.

I assume there isn't a convenient webui for chumps to run benchmarks with (apart from ooba perplexity, which I assume isn't the same thing?). Any advise?

you are viewing a single comment's thread
view the rest of the comments

[–] mattapperson@alien.top 1 points 2 years ago (6 children)

I am working on just such a tool... but it's not ready yet. I am building a CLI tool that lets you just run `$ ai evals run humanevalsplus openhermes-2.5` and your good to go. Uses Llama.cpp

[–] vikarti_anatra@alien.top 1 points 2 years ago (1 children)

I would be interested to use such thing (especially if it's possible to pass custom options to llama.cpp and ask for custom models to be loaded).

Would it be possible to do something like this:

I put list of models: OpenHermes-2.5-Mistral-7B, Toppy-7B, OpenHermes-2.5-AshhLimaRP-Mistral-7B, Noromaid-v0.1.1-20B, Noromaid-v1.1-13B

Tool download every model from HF with every quantization, runs tests, and provide table with tests results (including failed ones)

[–] mattapperson@alien.top 1 points 2 years ago

This can kinda be done, but it’s not as simple as just that. You would need to also infer in many cases the prompt templates. Also many/most benchmarks are designed with untuned models in mind, meaning you typically need to add a system prompt/instructions… doing that also adds complexity because the best prompt for one model is likely different from the next. Also chat vs instruct vs base models in the same eval would be… meh. That said I think there is value in this and working on it as part of my cli tool with some warnings that the results might be less then quantitative

load more comments (4 replies)