nero10578

joined 11 months ago
[–] nero10578@alien.top 1 points 9 months ago

You don’t NEED 3090/4090s. A 3x Tesla P40 setup still streams at reading speed running 120b models.

[–] nero10578@alien.top 1 points 9 months ago (1 children)

Huh its not really faster than Tesla P40s then for some reason.

[–] nero10578@alien.top 1 points 9 months ago (1 children)

There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.

[–] nero10578@alien.top 1 points 9 months ago

Not sure where they got 694GB/s for the Tesla P40, they're only 347GB/s of memory bandwidth.

[–] nero10578@alien.top 1 points 9 months ago (3 children)

What kind of token/s do you get with 2x3090 for the 70B models?

[–] nero10578@alien.top 1 points 9 months ago (2 children)

Dual CPUs would have terrible performance. This is because the processor is reading the whole model everytime its generating tokens and if you spread half the model onto a second CPU's memory then the cores in the first CPU would have to read that part of the model through the slow inter-CPU link. Vice versa with the second CPU's cores. llama.cpp would have to make a system to spread the workload across multi CPUs like they do across multi GPUs for this to work.

[–] nero10578@alien.top 1 points 9 months ago (1 children)

A V100 16GB is like $700 on ebay. RTX 3090 24GB can be had for a similar amount.

[–] nero10578@alien.top 1 points 10 months ago

Wait what? I am getting 2-3t/s on 3x P40 running Goliath GGUF Q4KS.

[–] nero10578@alien.top 1 points 10 months ago

Wonder what card you have that’s 20GB?

 

I updated to the latest commit because ooba said it uses the latest llama.cpp that improved performance. What I suspect happened is it uses more FP16 now because the tokens/s on my Tesla P40 got halved along with the power consumption and memory controller load.

You can fix this by doing:

git reset --hard 564d0cde8289a9c9602b4d6a2e970659492ad135

to go back to the last verified commit that didn't kill performance on the Tesla P40. Not sure how to fix this for future updates so maybe u/Oobabooga can chime in.

 

I have an Asus X99-E-10G WS board with an Intel Xeon E5 2679 V4. I know the CPU has 40 PCIe lanes and has support for IOMMU so passing through GPUs in proxmox is trivial.

This board has PLX chips to split 32x PCIe 3.0 lanes into its 7x PCIe slots that operate at all 8x or 4 of them at 16x. I have passed through multiple GPUs to VMs on this board without issues before, but I just got a Mellanox ConnectX3 FCBT to connect to my NAS and it seems like this is causing issues with passing through a GPU that's also on the same PLX chip as the Mellanox card.

I have a Tesla P100 that I am trying to pass through that's plugged into a PCIe slot coming from the second PLX chip that also has the Mellanox card plugged into another port from the same PLX chip. This causes a code 10 error in windows device manager that said there is not enough resources to start the API and the GPU won't start and can't be used by the driver.

I have 4G Decoding, Virtualization, VT-D and ACS enabled in bios as well as CSM disabled and it still does not work. It will only work if I plug my Tesla P100 into another slot that is connected to the first PLX chip while the mellanox card is in the second PLX chip. This is an issue because then I would effectively reduce the number of PCIe slots available for use for GPUs on the board.

Is this fixable or just an inherent behaviour of Mellanox 40G cards? Thanks for any help.

[–] nero10578@alien.top 1 points 10 months ago

Definitely thought this was for his homelab