This suggests that judging a model based on a single benchmark might not provide the full picture.
Duh... This has been a recurring problem with all these "benchmark leaderboards". It turns out that "training on the testing set is all you need"...
This suggests that judging a model based on a single benchmark might not provide the full picture.
Duh... This has been a recurring problem with all these "benchmark leaderboards". It turns out that "training on the testing set is all you need"...
I haven't got round to trying the xwin coder models, but the precursor 70b chat model was extremely impressive when compared against both chat GPT 3.5 and 4.
If you look at something like evolinstruct data its so similar to humane al itd be a surprise if models trained on that data (or other synthetic data) dont perform well
As a rule of thumb i only generally trust base models (even then its iffy) on benchmarks and for finetuned models only by using it