LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

So far GGUF is the best format as I realize it. Will NVlinked 2x3090 act like one 48GB? (alien.top)

submitted 11 months ago by easyllaama@alien.top to c/localllama@poweruser.forum

0 comments fedilink hide all child comments

I have tried single 4090 or 3090 to run 13B GGUF q8 getting 40-45t/s. It;s so fun to play at that speed. When run with 70B GGUF, I have to activate both cards and only get 5t/s. MultiGPU panelty? I know exllamav2 can be a lot better, however, it seemed that I can”t run exllamav2 with latest Chinese models for some unknown reason in ooga UI. So upset!

So for those who know and have been using nvlinked 2x3090, how fast is it to run 70b GGUF in terms of q4-q8 tokens/s? Is it simply as single 48GB 3090?

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here