Imaginary_Bench_7294

joined 10 months ago

So that really depends. You're talking about running a multi gpu setup. If all of your model is in the gpu, then your processor will not be a bottleneck at all. The clock speed of the PCIe bus is independent of the cpu cores, unless you're messing with overclocking. That's why they advertise PCIe 3.0, 4.0, 5.0, etc. The PCIe version dictates the bandwidth per lane.

That being said, multi gpu setups do introduce some overhead. If a model is split between GPUs, the PCIe interface becomes a modest bottleneck as they pass data back and forth. The greater the number of GPU's the model is split across, the greater the bottleneck.

If your goal is to run the model locally, your best option is to increase your Vram as much as you can. Main things to consider is the vram bandwidth of the card and the capacity. For a 70b 4 bit model you're looking at needing somewhere around 35-40 GB of vram.

The model alone will take roughly 35GB, the loader up to another 3GB, and then the full context length of 4096 could spill it over 40GB.

I run LZLV 70b, 4.65bit on 2x3090's and get 4.5+ T/s using ExllamaV2 and the EXL2 format. That is at full context length and chat mode in Oobabooga.

In the default/notebook modes I can get 7+ T/s at full context length.

Now, your power supply may be on the low side to add another card without putting power limiters on things. I'll use stock power settings as reference.

4080 is rated to hit 320 w

13700 is rated at 65W

Let's ass in another 100 watts for SSD's, HDD's, mobo and cooling solutions.

So you're looking at 485w of draw. You should always shoot for a minimum 10-15% overhead, which cuts your max draw down to 722-765 watts.

That leaves you 237-280w of possible room to play with.

So it's possible to add another video card to the computer, but you'll have to use GGUF and llama.cpp to do mixed compute with the video card and CPU. That will probably get you up to the 2, maybe 3 T/s at the start, but I don't know about full 4096 context.