AlpinDale

joined 1 year ago
[–] AlpinDale@alien.top 1 points 1 year ago

Yes well it should perform much higher than that. Turboderp ran MMLU at 3.25bpw and it was performing worse than other 70B models. I assume quantization further degrades the spelling consistency.

[–] AlpinDale@alien.top 1 points 1 year ago

I used Charles Goddard's mergekit.

[–] AlpinDale@alien.top 1 points 1 year ago (1 children)

It's up on Kobold Horde, you can give it a try yourself. Select the model from the AI menu. I think it's gonna be up for the weekend.

[–] AlpinDale@alien.top 1 points 1 year ago

As I mentioned here, it'd perform poorly on benchmarks until it's went through a few steps of full finetuning so the weight disagreement is ironed out.

[–] AlpinDale@alien.top 1 points 1 year ago (3 children)

Makes sense the benchmark results would be surprisingly low for goliath. After playing around with it for a few days, I've noticed two glaring issues:

  • it tends to make slight spelling mistakes
  • it hallucinates words They happen rarely, but frequent enough to throw off benchmarks. I'm very positive this can be solved by a quick full finetune over a 100 or so steps, which would align the layers to better work together.
[–] AlpinDale@alien.top 1 points 1 year ago (1 children)

The shearing process would likely need to close to 1 billion tokens of data, so I'd guess about a few days on ~24x A100-80G/H100s. And if we get a ~50B model out of it, we'd need to train that on around ~100B tokens, which would need at least 10x H100s for a few weeks. Overall, very expensive.

And yes, princeton-nlp did a few shears of Llama2 7B/13B. It's up on their HuggingFace.

[–] AlpinDale@alien.top 1 points 1 year ago

>confirmation bias
That's true. The model is up on the Kobold Horde if anyone wants to give it a try.

 

A few people here tried the Goliath-120B model I released a while back, and looks like TheBloke has released the quantized versions now. So far, the reception has been largely positive.

https://huggingface.co/TheBloke/goliath-120b-GPTQ

https://huggingface.co/TheBloke/goliath-120b-GGUF

https://huggingface.co/TheBloke/goliath-120b-AWQ

The fact that the model turned out good is completely unexpected. Every LM researcher I've spoken to about this in the past few days has been completely baffled. The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

On a related note, I've been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I'm now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we'd need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

If anyone has suggestions, please let me know. Cheers!