this post was submitted on 10 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

A few people here tried the Goliath-120B model I released a while back, and looks like TheBloke has released the quantized versions now. So far, the reception has been largely positive.

https://huggingface.co/TheBloke/goliath-120b-GPTQ

https://huggingface.co/TheBloke/goliath-120b-GGUF

https://huggingface.co/TheBloke/goliath-120b-AWQ

The fact that the model turned out good is completely unexpected. Every LM researcher I've spoken to about this in the past few days has been completely baffled. The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

On a related note, I've been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I'm now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we'd need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

If anyone has suggestions, please let me know. Cheers!

you are viewing a single comment's thread
view the rest of the comments
[–] noeda@alien.top 1 points 1 year ago (2 children)

Just finished the Hellaswag trial runs. First, here's a table from best to worst:

Model name 0-shot Hellaswag 400 tests %
goliath-120b_q6_k.gguf 69.75
euryale-1.3-l2-70b.Q6_K.gguf 66.5
airoboros-l2-70b-2.2.Q4_K_M.gguf 63.25
xwin-lm-70b-v0.1.Q6_K.gguf 63.0
yi_200k_Q8_0.gguf 59.21
openchat-3.5_Q8_0.gguf 53.25

The euryale and xwin models are the ones used to Frankenstein together the Goliath model.

The Goliath .gguf was quantized by myself, as was the Yi model. The rest are downloaded from TheBloke.

Even though Goliath shows up as the top model, here is why I don't think you should run off and tell everyone Goliath's the best model ever:

  1. The trials ran 400 random tests from the Hellaswag set. There is a random element in the final score. When I plugged in Goliath and Euryale results for 400 trials to compute the probability that Goliath is better at 0-shot Hellaswag vs. Euryale, I got 84% as result (97.83% for vs. Xwin). 84% is good but I was hoping it would be more like 99%. In other words, it's possible I randomly got a better result for Goliath simply because it got lucky in the choice of which Hellaswag tests it was asked to complete.

  2. This was the first time I ever tried running more rigorous tests on LLMs rather than eyeballing it so I may have made mistakes.

  3. The numbers can't be compared with the OpenLLM leaderboard (they use N-shot Hellaswag, forgot what N was), and I noticed they also don't line up with the llama.cpp link there. OpenLLM leaderboard, I expected it to not be the same but I can't explain why it doesn't match with the llama.cpp discussion.

  4. Hellaswag is just one benchmark and I looked at the examples inside the tests what it's actually asking the models and I think 0-shot testing is a bit brutal for these models. It might be a bit unfair for them. I thought the Yi model for example was supposed to be real good.

I would wait until proper benchmarks run by people with more resources can test this out. I don't plan on myself on updating these numbers.

BUT. It does look promising. I'm hoping more rigorous benchmarks will give some more confidence.

[–] AlpinDale@alien.top 1 points 1 year ago (2 children)

Makes sense the benchmark results would be surprisingly low for goliath. After playing around with it for a few days, I've noticed two glaring issues:

  • it tends to make slight spelling mistakes
  • it hallucinates words They happen rarely, but frequent enough to throw off benchmarks. I'm very positive this can be solved by a quick full finetune over a 100 or so steps, which would align the layers to better work together.
[–] polawiaczperel@alien.top 1 points 11 months ago

Your model is little breahthrough in local LLM's. What plans do you have right now? Could you try to merge some big models with for an example deepSeekCoder, or Phind? It would be awesome.

[–] noeda@alien.top 1 points 1 year ago (1 children)

Not sure if you misread, but it's actually high, i.e. it's better than Xwin and Euryale it's made out of (in this particular quick test).

It beat all the 70B models I tested there, although the gap is not super high.

[–] AlpinDale@alien.top 1 points 1 year ago

Yes well it should perform much higher than that. Turboderp ran MMLU at 3.25bpw and it was performing worse than other 70B models. I assume quantization further degrades the spelling consistency.

[–] a_beautiful_rhind@alien.top 1 points 1 year ago (1 children)

Surprise!.. xwin doing poorly among the 70b. It does bad when I test it vs chat logs too. Shows higher perplexity than yi-34b and a gaggle of other models, including base.

[–] randomfoo2@alien.top 1 points 1 year ago (1 children)

It depends on the use case. Each model may have their own strengths. I picked XWin and Airoboros as baseline 70B models for 2nd language conversational testing, and XWin outperformed (in human-evaled testing with a native speaker) a 70B model that had been pre-trained on an additional 100B tokens of said 2nd language. Shocking to say the least.

[–] a_beautiful_rhind@alien.top 1 points 1 year ago

My test was logs of chats with characters. Something that isn't widely publicly available so it can't be gamed. Xwin has very bad perplexity on those. Below that of codellama-34b.

xwin: 4.876139163970947

Codellama: 4.689054489135742

Same quantization..

70b-base scores: 3.69110918045044 Euryale-1.3: 3.8607137203216553

Dolphin 2.2 did surprisingly bad: 4.39600133895874 but not as bad as xwin.

Obviously it doesn't 100% track to a good model but all things combined about xwin (refusals, repeat issue, perplexity) put me off from it in a big way.