I used Charles Goddard's mergekit.
AlpinDale
It's up on Kobold Horde, you can give it a try yourself. Select the model from the AI menu. I think it's gonna be up for the weekend.
As I mentioned here, it'd perform poorly on benchmarks until it's went through a few steps of full finetuning so the weight disagreement is ironed out.
Makes sense the benchmark results would be surprisingly low for goliath. After playing around with it for a few days, I've noticed two glaring issues:
- it tends to make slight spelling mistakes
- it hallucinates words They happen rarely, but frequent enough to throw off benchmarks. I'm very positive this can be solved by a quick full finetune over a 100 or so steps, which would align the layers to better work together.
The shearing process would likely need to close to 1 billion tokens of data, so I'd guess about a few days on ~24x A100-80G/H100s. And if we get a ~50B model out of it, we'd need to train that on around ~100B tokens, which would need at least 10x H100s for a few weeks. Overall, very expensive.
And yes, princeton-nlp did a few shears of Llama2 7B/13B. It's up on their HuggingFace.
>confirmation bias
That's true. The model is up on the Kobold Horde if anyone wants to give it a try.
Yes well it should perform much higher than that. Turboderp ran MMLU at 3.25bpw and it was performing worse than other 70B models. I assume quantization further degrades the spelling consistency.