My understanding is basically, they are data sets the model is compared to. Say you wanted to see how well you knew math. You took a math test, and then your answers were compared to a key of answers...
Some of my notes about those benchmarks
GSM8K is a dataset of 8.5K high-quality linguistically diverse grade school math word problems created by human problem writers
HellaSwag is the large language model benchmark for commonsense reasoning.
Truful QA: is a benchmark to measure whether a language model is truthful in generating answers to questions.
My understanding is basically, they are data sets the model is compared to. Say you wanted to see how well you knew math. You took a math test, and then your answers were compared to a key of answers...
Some of my notes about those benchmarks
GSM8K is a dataset of 8.5K high-quality linguistically diverse grade school math word problems created by human problem writers
HellaSwag is the large language model benchmark for commonsense reasoning.
Truful QA: is a benchmark to measure whether a language model is truthful in generating answers to questions.
Winogrande - Common sense reasoning