LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Goliath-120B - quants and future plans (alien.top)

submitted 2 years ago by AlpinDale@alien.top to c/localllama@poweruser.forum

41 comments fedilink hide all child comments

A few people here tried the Goliath-120B model I released a while back, and looks like TheBloke has released the quantized versions now. So far, the reception has been largely positive.

https://huggingface.co/TheBloke/goliath-120b-GPTQ

https://huggingface.co/TheBloke/goliath-120b-GGUF

https://huggingface.co/TheBloke/goliath-120b-AWQ

The fact that the model turned out good is completely unexpected. Every LM researcher I've spoken to about this in the past few days has been completely baffled. The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

On a related note, I've been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I'm now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we'd need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

If anyone has suggestions, please let me know. Cheers!

you are viewing a single comment's thread
view the rest of the comments

[–] multiverse_fan@alien.top 1 points 2 years ago

Goliath was created by merging layers of Xwin and Euryale. (from their model card)

The layer ranges used are as follows:
- range 0, 16 Xwin 
- range 8, 24 Euryale 
- range 17, 32 Xwin 
- range 25, 40 Euryale 
- range 33, 48 Xwin 
- range 41, 56 Euryale 
- range 49, 64 Xwin 
- range 57, 72 Euryale 
- range 65, 80 Xwin

I'm not sure how the model would be reduced to 70B unless it's through removing layers. Is that what "shearing" is? I don't understand what is being pruned in that, is it layers?