this post was submitted on 01 Dec 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I have tried single 4090 or 3090 to run 13B GGUF q8 getting 40-45t/s. It;s so fun to play at that speed. When run with 70B GGUF, I have to activate both cards and only get 5t/s. MultiGPU panelty? I know exllamav2 can be a lot better, however, it seemed that I can”t run exllamav2 with latest Chinese models for some unknown reason in ooga UI. So upset!

So for those who know and have been using nvlinked 2x3090, how fast is it to run 70b GGUF in terms of q4-q8 tokens/s? Is it simply as single 48GB 3090?

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here