this post was submitted on 13 Nov 2023
1 points (100.0% liked)

LocalLLaMA

4 readers
4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago
MODERATORS
 

I am talking about this particular model:

https://huggingface.co/TheBloke/goliath-120b-GGUF

I specifically use: goliath-120b.Q4_K_M.gguf

I can run it on runpod.io on this A100 instance with "humane" speed, but it is way too slow for creating long form text.

https://preview.redd.it/fz28iycv860c1.png?width=350&format=png&auto=webp&s=cd034b6fb6fe80f209f5e6d5278206fd714a1b10

These are my settings in text-generation-webui:

https://preview.redd.it/vw53pc33960c1.png?width=833&format=png&auto=webp&s=0fccbeac0994447cf7b7462f65d79f2e8f8f1969

Any advice? Thanks

top 13 comments
sorted by: hot top controversial new old
[–] Good-Biscotti957@alien.top 1 points 2 years ago

Which mode do you use? Chat, chat-instruct or instruct?

[–] whtne047htnb@alien.top 1 points 2 years ago (2 children)

The GGUF one has 140 layers, more than what the textgen UI supports (128). So the slowness may be because you are using CPU for some layers (check your terminal output when loading the model). But you can manually change the source code and set the max value of the n_gpu_layers slider to a higher value (just grep for it).

[–] Evening_Ad6637@alien.top 1 points 2 years ago

This is the only helpful because right answer.

[–] kruk2@alien.top 1 points 2 years ago (1 children)

or open the UI, go to model page, right click on the layers slider -> inspect element
and update max value for the input field from 128 to 256

[–] abandonedexplorer@alien.top 0 points 2 years ago (1 children)

Cant believe that worked lol! Thank you so much. The speed increased significantly!

[–] MINIMAN10001@alien.top 1 points 2 years ago

I mean it makes sense The value is chosen we're simply chosen for being a reasonable window at the time.

There was nothing hard coded about them they were simply a range of values that they had set for the UI.

It certainly is interesting though.

[–] panchovix@alien.top 1 points 2 years ago (3 children)

Why don't you use exl2? Assuming it's the A100 80GB, you can run up to 5bpw I think,

I have done quants at 3, 4.5 and 4.85bpw.

https://huggingface.co/Panchovix/goliath-120b-exl2

https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal

I have 2x4090+1x3090, I get 2 t/s on GGUF (all layers on GPU) vs 10 t/s on exllamav2.

[–] abandonedexplorer@alien.top 1 points 2 years ago

Thanks. Will try this. No idea how these really work so that is why i am asking :)

[–] Worldly-Mistake-8147@alien.top 1 points 2 years ago (1 children)

I'm sorry for a little side-track, but how much context you able to squeeze into your 3 GPUs with Goliath's 4bit quant?
I'm considering to add another 3090 to my own doble-GPU setup just to run this model.

[–] panchovix@alien.top 1 points 2 years ago (1 children)

I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn't use 8bit cache

Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.

[–] Dead_Internet_Theory@alien.top 1 points 2 years ago

That is awesome. What kind of platform do you use for that 3 GPUs setup?

[–] nero10578@alien.top 1 points 2 years ago

Wait what? I am getting 2-3t/s on 3x P40 running Goliath GGUF Q4KS.

[–] Additional-Box-6814@alien.top 1 points 2 years ago

I use It through openrouterai, around 200k t/$