this post was submitted on 13 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I'm sorry for a little side-track, but how much context you able to squeeze into your 3 GPUs with Goliath's 4bit quant?
I'm considering to add another 3090 to my own doble-GPU setup just to run this model.
I tested 4K and it worked fine at 4.5bpw. Max will be prob about 6k. I didn't use 8bit cache
Now 4.5bpw is kinda overkill, 4.12~ bpw is like 4bit 128g gptq, and that would let you use a lot more context.
That is awesome. What kind of platform do you use for that 3 GPUs setup?