this post was submitted on 13 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I am talking about this particular model:

https://huggingface.co/TheBloke/goliath-120b-GGUF

I specifically use: goliath-120b.Q4_K_M.gguf

I can run it on runpod.io on this A100 instance with "humane" speed, but it is way too slow for creating long form text.

https://preview.redd.it/fz28iycv860c1.png?width=350&format=png&auto=webp&s=cd034b6fb6fe80f209f5e6d5278206fd714a1b10

These are my settings in text-generation-webui:

https://preview.redd.it/vw53pc33960c1.png?width=833&format=png&auto=webp&s=0fccbeac0994447cf7b7462f65d79f2e8f8f1969

Any advice? Thanks

you are viewing a single comment's thread
view the rest of the comments
[–] panchovix@alien.top 1 points 1 year ago (5 children)

Why don't you use exl2? Assuming it's the A100 80GB, you can run up to 5bpw I think,

I have done quants at 3, 4.5 and 4.85bpw.

https://huggingface.co/Panchovix/goliath-120b-exl2

https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal

I have 2x4090+1x3090, I get 2 t/s on GGUF (all layers on GPU) vs 10 t/s on exllamav2.

[–] Worldly-Mistake-8147@alien.top 1 points 1 year ago (1 children)

I'm sorry for a little side-track, but how much context you able to squeeze into your 3 GPUs with Goliath's 4bit quant?
I'm considering to add another 3090 to my own doble-GPU setup just to run this model.

[–] panchovix@alien.top 1 points 1 year ago (1 children)

I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn't use 8bit cache

Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.

[–] Dead_Internet_Theory@alien.top 1 points 11 months ago

That is awesome. What kind of platform do you use for that 3 GPUs setup?

load more comments (3 replies)