this post was submitted on 10 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

A few people here tried the Goliath-120B model I released a while back, and looks like TheBloke has released the quantized versions now. So far, the reception has been largely positive.

https://huggingface.co/TheBloke/goliath-120b-GPTQ

https://huggingface.co/TheBloke/goliath-120b-GGUF

https://huggingface.co/TheBloke/goliath-120b-AWQ

The fact that the model turned out good is completely unexpected. Every LM researcher I've spoken to about this in the past few days has been completely baffled. The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

On a related note, I've been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I'm now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we'd need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

If anyone has suggestions, please let me know. Cheers!

top 41 comments
sorted by: hot top controversial new old
[–] those2badguys@alien.top 1 points 1 year ago (1 children)

I'm just a lowly end user and spectator, can someone ballpark how much it'd cost to shear Goliath-120B to 70B so I can wake up and sip my coffee then spray it on my monitor and say "good lord that's rather expensive!"

Also, how much for a 7B to 1.3B? and has it been done before? How bad is the drop in quality? I mean older 7B models are not so great to began with so the idea of seeing Mistral-7B downsized to 1.3B would be kind of fun and definitely something I want to play with.

[–] AlpinDale@alien.top 1 points 1 year ago (1 children)

The shearing process would likely need to close to 1 billion tokens of data, so I'd guess about a few days on ~24x A100-80G/H100s. And if we get a ~50B model out of it, we'd need to train that on around ~100B tokens, which would need at least 10x H100s for a few weeks. Overall, very expensive.

And yes, princeton-nlp did a few shears of Llama2 7B/13B. It's up on their HuggingFace.

[–] those2badguys@alien.top 1 points 1 year ago

Thank you kindly for the response.

a few days on ~24x A100-80G/H100s

I looked at some pricing and did some two handed 10 finger math and estimated it at 12-15 grand?

10x H100s for a few weeks

again, just looking at some retail cloud GPU renters, 20-25 grand?

I'm sure you have better things to do with your time so without doing too much on your end, how far off am I on these guesses?

[–] HollowGalaxy@alien.top 1 points 1 year ago

What hardware do you use to run the training?
and what hardware do you use to run inference?

[–] ReMeDyIII@alien.top 1 points 1 year ago (1 children)

If I were to attempt to fit this on a Runpod $0.79/hr (A6000 48GB VRAM, 50GB RAM, 8vCPU ), what's my best option? Is it even possible?

[–] panchovix@alien.top 1 points 1 year ago

You can run 3bpw of exl2, I did some quants here https://huggingface.co/Panchovix/goliath-120b-exl2

[–] noeda@alien.top 1 points 1 year ago (2 children)

I've done bunch of D&D character sheets with this and yeah I think is pretty good. (Still not sure if it's just Euryale though which looks like has been trained on that kind of data).

I would love to see where Goliath ranks in the traditional benchmarks, Hellaswag, Winogrande etc. (has anyone run them yet?) Very curious if this model is strictly better than the two models it was made out of in a more rigorous test.

I'm really hoping the frankensteining method can be proven that it really does improve the smarts compared to the models it is made out of.

I've been using a Q6 gguf quant I made myself on day 1 and it works well. 1.22 tokens per second on a pure CPU + DDR5 memory and I think around 90GB of memory.

[–] noeda@alien.top 1 points 1 year ago (2 children)

Just finished the Hellaswag trial runs. First, here's a table from best to worst:

Model name 0-shot Hellaswag 400 tests %
goliath-120b_q6_k.gguf 69.75
euryale-1.3-l2-70b.Q6_K.gguf 66.5
airoboros-l2-70b-2.2.Q4_K_M.gguf 63.25
xwin-lm-70b-v0.1.Q6_K.gguf 63.0
yi_200k_Q8_0.gguf 59.21
openchat-3.5_Q8_0.gguf 53.25

The euryale and xwin models are the ones used to Frankenstein together the Goliath model.

The Goliath .gguf was quantized by myself, as was the Yi model. The rest are downloaded from TheBloke.

Even though Goliath shows up as the top model, here is why I don't think you should run off and tell everyone Goliath's the best model ever:

  1. The trials ran 400 random tests from the Hellaswag set. There is a random element in the final score. When I plugged in Goliath and Euryale results for 400 trials to compute the probability that Goliath is better at 0-shot Hellaswag vs. Euryale, I got 84% as result (97.83% for vs. Xwin). 84% is good but I was hoping it would be more like 99%. In other words, it's possible I randomly got a better result for Goliath simply because it got lucky in the choice of which Hellaswag tests it was asked to complete.

  2. This was the first time I ever tried running more rigorous tests on LLMs rather than eyeballing it so I may have made mistakes.

  3. The numbers can't be compared with the OpenLLM leaderboard (they use N-shot Hellaswag, forgot what N was), and I noticed they also don't line up with the llama.cpp link there. OpenLLM leaderboard, I expected it to not be the same but I can't explain why it doesn't match with the llama.cpp discussion.

  4. Hellaswag is just one benchmark and I looked at the examples inside the tests what it's actually asking the models and I think 0-shot testing is a bit brutal for these models. It might be a bit unfair for them. I thought the Yi model for example was supposed to be real good.

I would wait until proper benchmarks run by people with more resources can test this out. I don't plan on myself on updating these numbers.

BUT. It does look promising. I'm hoping more rigorous benchmarks will give some more confidence.

[–] a_beautiful_rhind@alien.top 1 points 1 year ago (1 children)

Surprise!.. xwin doing poorly among the 70b. It does bad when I test it vs chat logs too. Shows higher perplexity than yi-34b and a gaggle of other models, including base.

[–] randomfoo2@alien.top 1 points 1 year ago (1 children)

It depends on the use case. Each model may have their own strengths. I picked XWin and Airoboros as baseline 70B models for 2nd language conversational testing, and XWin outperformed (in human-evaled testing with a native speaker) a 70B model that had been pre-trained on an additional 100B tokens of said 2nd language. Shocking to say the least.

[–] a_beautiful_rhind@alien.top 1 points 1 year ago

My test was logs of chats with characters. Something that isn't widely publicly available so it can't be gamed. Xwin has very bad perplexity on those. Below that of codellama-34b.

xwin: 4.876139163970947

Codellama: 4.689054489135742

Same quantization..

70b-base scores: 3.69110918045044 Euryale-1.3: 3.8607137203216553

Dolphin 2.2 did surprisingly bad: 4.39600133895874 but not as bad as xwin.

Obviously it doesn't 100% track to a good model but all things combined about xwin (refusals, repeat issue, perplexity) put me off from it in a big way.

[–] AlpinDale@alien.top 1 points 1 year ago (2 children)

Makes sense the benchmark results would be surprisingly low for goliath. After playing around with it for a few days, I've noticed two glaring issues:

  • it tends to make slight spelling mistakes
  • it hallucinates words They happen rarely, but frequent enough to throw off benchmarks. I'm very positive this can be solved by a quick full finetune over a 100 or so steps, which would align the layers to better work together.
[–] noeda@alien.top 1 points 1 year ago (1 children)

Not sure if you misread, but it's actually high, i.e. it's better than Xwin and Euryale it's made out of (in this particular quick test).

It beat all the 70B models I tested there, although the gap is not super high.

[–] AlpinDale@alien.top 1 points 1 year ago

Yes well it should perform much higher than that. Turboderp ran MMLU at 3.25bpw and it was performing worse than other 70B models. I assume quantization further degrades the spelling consistency.

[–] polawiaczperel@alien.top 1 points 11 months ago

Your model is little breahthrough in local LLM's. What plans do you have right now? Could you try to merge some big models with for an example deepSeekCoder, or Phind? It would be awesome.

[–] FPham@alien.top 1 points 1 year ago

I suspect that it behaves sort of as if you have (fictious) Xwin and Eurayle adapter and apply it as catsum which sums the rank (so 2x256 rank would became 512 rank!) but improves the response only a tiny bit.

But in this case we are summing "virtual" rank of two 70b models. The model could be a smidgen smarter, but not that much because a huge chunks of weights are overlapping. We are wasting probably 80b parameters :) that do not contribute.

A correct test has to be done between the Sum and both Xwin and Eurayle to see the actual result. I've seen it many times with fine-tuning when I attributed the good response to the fine-tune, but in fact it was mostly due to the prior model, when I A/B and the fine-tune really added only a tiny bit.

I'm honestly more interested in the opposite way to make models smaller while maybe loosing only a smidgen of knowledge.

[–] silenceimpaired@alien.top 1 points 1 year ago

Please try 70b down to ~30b with a llama 2 model. Thanks!

[–] ReturningTarzan@alien.top 1 points 1 year ago (1 children)

If anyone has suggestions, please let me know. Cheers!

The suggestion I'd give, apart from finetuning, would just be to do some actual tests. Construct some scenarios that test the model's ability to "show not tell" and so on, and contrast with smaller models and/or with a "null hypothesis" Frankenstein model where the added layers are just random matrices, etc.

Ideally, if there's nothing you can do to objectively measure the model's performance, try to set up a blind test of some sort to see if users actually prefer the Frankenstein model over the two models it was spliced together from.

Not to disparage the project or anything, but confirmation bias is a real thing, and it's especially rampant in the LLM space.

[–] AlpinDale@alien.top 1 points 1 year ago

>confirmation bias
That's true. The model is up on the Kobold Horde if anyone wants to give it a try.

[–] Cybernetic_Symbiotes@alien.top 1 points 1 year ago (1 children)

This is highly interesting and unintuitive. Have you written down the details of your approach anywhere? Why did you interleave in the manner you did?

Have you tested on GSM8K or DROP? Something I noticed in the recent HFLB update is that a lot of high flying Mistral merges scored poorly on those two benchmarks. DROP scores in particular, plummeted.

[–] AlpinDale@alien.top 1 points 1 year ago

As I mentioned here, it'd perform poorly on benchmarks until it's went through a few steps of full finetuning so the weight disagreement is ironed out.

[–] Emotional-Dust-1367@alien.top 1 points 1 year ago

Does it perform any different, better or worse, than the source 70B? What’s the advantage?

[–] OnurCetinkaya@alien.top 1 points 1 year ago

Waiting for someone to stitch 100 phi-1.3Bs together :D

[–] andrewlapp@alien.top 1 points 1 year ago (1 children)

"An auto-regressive causal LM created by combining 2x finetuned Llama-2 70B into one."

Wow, fascinating. Could you share the code you used to do this?

[–] AlpinDale@alien.top 1 points 1 year ago

I used Charles Goddard's mergekit.

[–] Distinct-Target7503@alien.top 1 points 1 year ago

I'm wondering what that approach could generate if applied to codellama 34b.

A Frankenstein 2x 34B model may be more easy to test, and we have 70B model for reference.... Also, imo code generation is a good way to test the behavior of the models and to discriminate some lucky results that "sounds right".

[–] Available-Appeal6460@alien.top 1 points 1 year ago (2 children)

Who are you talking about ? Main repo has like 15 downloads and bloke quants has 0. We are talking here about maybe 2-5 people who downloaded it.

I saw this model being talked about few times here and i feel that person who created it uses smurfs accounts to promote it for some reason.

It doesn't have any benchmarks either officially done so it is not like people can even say it is better than anything.

I saw few franken models and every one of them is marginal upgrade due to just share about of parameters. Proper finetune should beat it easily.

[–] FullOf_Bad_Ideas@alien.top 1 points 1 year ago (1 children)

Don't trust huggingface download stats at all, they are garbage.

[–] Aaaaaaaaaeeeee@alien.top 1 points 1 year ago

Agreed, it depends how you download, are you logged in, do you use the official download tool, etc

[–] AlpinDale@alien.top 1 points 1 year ago (1 children)

It's up on Kobold Horde, you can give it a try yourself. Select the model from the AI menu. I think it's gonna be up for the weekend.

[–] P00PY-PANTS@alien.top 1 points 1 year ago

I've been using it all morning on the Horde since there are a ton of slots open. So far it's been giving me awesome results across a dozen or so different character templates I've tried.

The only exception being it sometimes get's stuck repeating the same response even if you refresh a dozen times or repeatedly tacking the same description of a person/scene on the end of it's replies.

For reference tho both Xwin and euryale 70b models do that to me sometimes too so might be my settings or something.

I tried the model. since I do a lot of historical research I felt like checking it out. I was very disappointed. 1- he answers coldly and succinctly, often getting nervous 2 Even simple answers were 99 per cent wrong. Example I did donande on Hannibal during the Punic war in Spain. the results were demoralising

[–] a_beautiful_rhind@alien.top 1 points 1 year ago

Started d/l the 3.0bpw quant since it fits on 2 GPU. Now that there are GGUF, I'll try some bigger ones if the 3 bit does well.

[–] tenmileswide@alien.top 1 points 1 year ago (1 children)

I'm a huge fan of this model. It writes conversationally in a way I've not seen any model do before. More than anything, it's *funny.*. It can be sarcastic, it can be witty, it can do alliteration and meter, and most incredibly, it does so with the illusion of being under its own free will. And it shows rather tells than better than any model I've seen before.

[–] Additional-Box-6814@alien.top 1 points 1 year ago

It's like the fucking goguetta of the LLMs 🤣

I am running some standard queries through the Q3 version of this model and compared to GPT4 and the Q4 versions of Wizard-70b and Dolphin-2.2-70b, I can definitely see it is doing very well.

[–] Additional-Box-6814@alien.top 1 points 1 year ago

Hi, recently I've been testing several models from openrouterai, and I've to say that, so far golliath 120B is the best non-GPT LLM I've the opportunitty to play with. I would say that, indulging some issues when using It, It seems close to GPT3. Of course, I've not been doing in-depth assesments, my opinion is from the perspective of a generative ai curious user.

So dude, congratulations for your work.

By the way, how many time takes to train a model like this with your machine? 5500$ is soo cheap in contrast i though It could be.

[–] Aspie96@alien.top 1 points 1 year ago

Would you consider adding a license to your repo?

Upstream is MIT-licensed.

[–] Illustrious_Sand6784@alien.top 1 points 11 months ago

I'm quite impressed with Goliath so far, so thank you and everyone who helped you make it.

The plan moving forward, in my opinion, is to finetune this model (preferably a full finetune) so that the stitched layers get to know each other better. Hopefully I can find the compute to do that soon :D

My suggestion is that it would be best to try out a 4-bit QLoRA fine-tune first and see how it preforms before spending the money/compute required to do a full fine-tune of such a massive model and have it possibly turn out to be mediocre.

On a related note, I've been working on LLM-Shearing lately, which would essentially enable us to shear down a transformer down to much smaller sizes, while preserving accuracy. The reason goliath-120b came to be was an experiment in moving at the opposite direction of shearing. I'm now wondering if we can shear a finetuned Goliath-120B to around ~70B again and end up with a much better 70B model than the existing ones. This would of course be prohibitively expensive, as we'd need to do continued pre-train after the shearing/pruning process. A more likely approach, I believe, is shearing Mistral-7B to ~1.3B and perform continued pretrain on about 100B tokens.

I do think small models have a lot of good uses already and still have plenty of potential, especially ones that were created or fine-tuned for a specific task. I'm also sure we'll get a general purpose 1-10B parameter model that's GPT-4 level within 5 years, but I really don't see any 7B parameter model outperforming a good Llama-2-70B fine-tune before the second half of 2024 unless there's some big breakthrough. So I'd really encourage you to do some more research in this direction, as there's plenty of others working on improving small models, but barely anyone doing research on improving large models.

I know that it requires a lot of money and compute to fine-tune large models, that the time and cost increase the larger the model is, and that a majority of people can't run large models locally. I know those are the main reasons why large models don't get as much attention as smaller ones, but come on, there's a new small base model every week now, while I was stuck with LLaMA-65B for like half a year, and then I was stuck with Llama-2 70B for months, and now the only better model (which might not actually be much better, still waiting for the benchmarks...) that was only very recently released is not really even a base model, as it's a merge of two fine-tuned Llama-2-70B models. Mistral-70B may not even be available to download and won't be under a free license, and Yi-100B will be completely proprietary and unavailable to download, which leaves no upcoming models besides Llama-3-70B that are likely to outperform Llama-2-70B.

[–] audioen@alien.top 1 points 11 months ago

I tested this model a little bit. It sometimes writes nonsense, like bad words, but I guess that's to be expected given that we have kind of copypasted and bunched couple of closely related models together. Not too surprising if it sometimes predicts badly.

However, I liked the writing quality a lot. Even after a simple trial test run, it is clearly much better than the base tollama2-70b, and by a lot. The base model is kind of tightlipped and extremely conservative/boring and it is hard to coax it to write creative output. However, it might actually be regression related to either of the base models it is based on.

[–] multiverse_fan@alien.top 1 points 11 months ago

Goliath was created by merging layers of Xwin and Euryale. (from their model card)

The layer ranges used are as follows:
- range 0, 16 Xwin 
- range 8, 24 Euryale 
- range 17, 32 Xwin 
- range 25, 40 Euryale 
- range 33, 48 Xwin 
- range 41, 56 Euryale 
- range 49, 64 Xwin 
- range 57, 72 Euryale 
- range 65, 80 Xwin

I'm not sure how the model would be reduced to 70B unless it's through removing layers. Is that what "shearing" is? I don't understand what is being pruned in that, is it layers?