I've switched to ExLlamav2 + EXL2 as that lets me run 120B models entirely in 48 GB VRAM (2x 3090 GPUs) at 20 T/s. And even if it's just 3-bit
Wow, can I ask how you got this? Because I used 2x3090 the same (x16/x16, no nvlink,fresh new text generation webui,120b goliath-rpcal 3bpw exl2) and only got 10t/s
I got 6~8t/s when using 3x3090 load 4.5bpw, and another person also got the same speed, it seems your speed is almost twice as fast as mine, which make my brain explosion
Yes, I have set it up to not swap to system ram, and when changing the gpu allocation test multiple times, the Vram OOM error will be reported immediately, so I guess no data is swapped to system ram
I will try the 531.79 driver, thanks for the information