easyllaama

joined 10 months ago
 

I have tried single 4090 or 3090 to run 13B GGUF q8 getting 40-45t/s. It;s so fun to play at that speed. When run with 70B GGUF, I have to activate both cards and only get 5t/s. MultiGPU panelty? I know exllamav2 can be a lot better, however, it seemed that I can”t run exllamav2 with latest Chinese models for some unknown reason in ooga UI. So upset!

So for those who know and have been using nvlinked 2x3090, how fast is it to run 70b GGUF in terms of q4-q8 tokens/s? Is it simply as single 48GB 3090?

[–] easyllaama@alien.top 1 points 9 months ago

Try use GGUF, this format likes single GPU especially you have 80GB vram. I think you can run 70gb GGUF with all layers in GPU.

[–] easyllaama@alien.top 1 points 10 months ago

My AMD 7950X3D ( 16 core 32 threads), 64GB DDR5, Single RTX 4090 on 13B Xwin GGUF q8 can run at 45T/S. With exllamav2, 2x 4090 can run 70B q4 at 15T/s. Motherboard is Asus Pro Art AM5. In Local LLama, I think you can run similar speed with RTX 3090s. But in SD, 4090 is 70% better though.

[–] easyllaama@alien.top 1 points 10 months ago

‘The performance hit is too much on multigpu systems when using GGUF’

I agree. GGuF has multi GPU panelty. But it”s the most friendly to Apple silicons. I have same setup with you. one 4090 can run Xwin 13b at 40t/s. but when 2 cards present, it get only 1/4 of speed at 10t/s. So to get it fast, I have to flag CUDA device to single card while 2 cards present.

Since GGUF liks single GPU, those who have 3090/4090 will find 34B the best spot with the format.