this post was submitted on 29 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

when we benchmark different LLMs on different datasets (MMLU, TriviaQA, MATH, HellaSwag, etc.), what are the the signification of these scores? the accuracy? another metric? how can i know the metrics of each dataset (MMLU, etc.)

https://preview.redd.it/5glmddnwsb3c1.png?width=2158&format=png&auto=webp&s=fcaf6e55d62445f3007380f06649455b29f8b2ec

top 3 comments
sorted by: hot top controversial new old
[–] RexRecruiting@alien.top 1 points 11 months ago (1 children)

My understanding is basically, they are data sets the model is compared to. Say you wanted to see how well you knew math. You took a math test, and then your answers were compared to a key of answers...

Some of my notes about those benchmarks

GSM8K is a dataset of 8.5K high-quality linguistically diverse grade school math word problems created by human problem writers

HellaSwag is the large language model benchmark for commonsense reasoning.

Truful QA: is a benchmark to measure whether a language model is truthful in generating answers to questions.

Winogrande - Common sense reasoning

[–] shaman-warrior@alien.top 1 points 11 months ago

Everything is common sense reasoning, we need better definitions

[–] ThisGonBHard@alien.top 1 points 11 months ago

Nothing, sadly.

Models are trained on the questions, to improve performance, making the tests moot.