overview for nero10578

How to upgrade to the next VRAM breakpoints, and is it worth it? in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago

You don’t NEED 3090/4090s. A 3x Tesla P40 setup still streams at reading speed running 120b models.

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago (1 children)

Huh its not really faster than Tesla P40s then for some reason.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago (1 children)

There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.

List of all GPUs and dedicated processors for AI workloads in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago

Not sure where they got 694GB/s for the Tesla P40, they're only 347GB/s of memory bandwidth.

🐺🐦‍⬛ **Big** LLM Comparison/Test: 3x 120B, 12x 70B, 2x 34B, GPT-4/3.5 in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago (3 children)

What kind of token/s do you get with 2x3090 for the 70B models?

Does Dual EPYC work for LLMs? in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago (2 children)

Dual CPUs would have terrible performance. This is because the processor is reading the whole model everytime its generating tokens and if you spread half the model onto a second CPU's memory then the cores in the first CPU would have to read that part of the model through the slow inter-CPU link. Vice versa with the second CPU's cores. llama.cpp would have to make a system to spread the workload across multi CPUs like they do across multi GPUs for this to work.

Serious inquiry: I've been tinkering a lot with finetuning and was wondering if it would be worth to buy a V100 of my own in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago (1 children)

A V100 16GB is like $700 on ebay. RTX 3090 24GB can be had for a similar amount.

Where and how to run Goliath 120b GGUF with good performance? in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago

Wait what? I am getting 2-3t/s on 3x P40 running Goliath GGUF Q4KS.

What is considered the best uncensored LLM right now? in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago

Wonder what card you have that’s 20GB?

1

PSA: If you have Telsa P40 which has abysmal FP16 performance DO NOT update oobabooga past commit 564d0cde8289a9c9602b4d6a2e970659492ad135 (alien.top)

submitted 2 years ago by nero10578@alien.top to c/localllama@poweruser.forum

1 comments fedilink

I updated to the latest commit because ooba said it uses the latest llama.cpp that improved performance. What I suspect happened is it uses more FP16 now because the tokens/s on my Tesla P40 got halved along with the power consumption and memory controller load.

You can fix this by doing:

git reset --hard 564d0cde8289a9c9602b4d6a2e970659492ad135

to go back to the last verified commit that didn't kill performance on the Tesla P40. Not sure how to fix this for future updates so maybe u/Oobabooga can chime in.

2

Mellanox CX3 40G card causes issues with GPU passthrough on Asus X99-E-10G WS (alien.top)

submitted 2 years ago by nero10578@alien.top to c/homelab@selfhosted.forum

0 comments fedilink

I have an Asus X99-E-10G WS board with an Intel Xeon E5 2679 V4. I know the CPU has 40 PCIe lanes and has support for IOMMU so passing through GPUs in proxmox is trivial.

This board has PLX chips to split 32x PCIe 3.0 lanes into its 7x PCIe slots that operate at all 8x or 4 of them at 16x. I have passed through multiple GPUs to VMs on this board without issues before, but I just got a Mellanox ConnectX3 FCBT to connect to my NAS and it seems like this is causing issues with passing through a GPU that's also on the same PLX chip as the Mellanox card.

I have a Tesla P100 that I am trying to pass through that's plugged into a PCIe slot coming from the second PLX chip that also has the Mellanox card plugged into another port from the same PLX chip. This causes a code 10 error in windows device manager that said there is not enough resources to start the API and the GPU won't start and can't be used by the driver.

I have 4G Decoding, Virtualization, VT-D and ACS enabled in bios as well as CSM disabled and it still does not work. It will only work if I plug my Tesla P100 into another slot that is connected to the first PLX chip while the mellanox card is in the second PLX chip. This is an issue because then I would effectively reduce the number of PCIe slots available for use for GPUs on the board.

Is this fixable or just an inherent behaviour of Mellanox 40G cards? Thanks for any help.

I am going to buy H100s. There are too many options. in c/localllama@poweruser.forum

[–] nero10578@alien.top 1 points 2 years ago

Definitely thought this was for his homelab