LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

submitted 10 months ago by abandonedexplorer@alien.top to c/localllama@poweruser.forum

13 comments fedilink hide all child comments

I am talking about this particular model:

I specifically use: goliath-120b.Q4_K_M.gguf

I can run it on runpod.io on this A100 instance with "humane" speed, but it is way too slow for creating long form text.

These are my settings in text-generation-webui:

Any advice? Thanks

you are viewing a single comment's thread
view the rest of the comments

[–] panchovix@alien.top 1 points 10 months ago (5 children)

Why don't you use exl2? Assuming it's the A100 80GB, you can run up to 5bpw I think,

I have done quants at 3, 4.5 and 4.85bpw.

I have 2x4090+1x3090, I get 2 t/s on GGUF (all layers on GPU) vs 10 t/s on exllamav2.

[–] abandonedexplorer@alien.top 1 points 10 months ago

Thanks. Will try this. No idea how these really work so that is why i am asking :)

load more comments (4 replies)