this post was submitted on 18 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Was wondering if theres anyway to use a bunch of old equipment like this to build an at home crunch center for running your own LLM at home, and whether it would be worth it.

top 10 comments
sorted by: hot top controversial new old
[–] Material1276@alien.top 1 points 10 months ago

Another consideration is that I was told by someone with multiple cards, that if you split your layers across multiple cards, they don't all process the layers simultaneously.

So, if you are on 3x cards, you don't get a parallel performance benefit of all cards working at the same time. It processes layers on card 1, then card 2, then card 3.

The slowest card will obviously have the worst speed. Not sure what this will do for your load times of a model or your electricity bill, as well as the fact you need a system big enough to fit them all in.

[–] Murky-Ladder8684@alien.top 1 points 10 months ago

Those series of Nvidia gpus didn't have tensor cores yet and believe they started in 20xx series. I not sure how much it impacts inference purposes vs training/fine tuning but worth doing more research. From what I gathered the answer is "no" unless you use a 10xx for like monitor output, TTS, or other smaller co-llm use that you don't want taking vram away from your main LLM GPUs.

[–] BlissfulEternalLotus@alien.top 1 points 10 months ago

I wish they come up with some extendable tensor chips that can work with old laptops.

Currently only 7b is the only model we can run comfortably. Even for 13 b, it's slower and it needs quite a bit of adjustment.

[–] 512DuncanL@alien.top 1 points 10 months ago (1 children)

You might as well use the cards if you have them already. I'm currently getting around 5-6 tokens per second when running nous-capybara 34b q4_k_m on a 2080ti 22gb and a p102 10gb (basically a semi lobotomized 1080ti). The p102 does bottleneck the 2080ti, but hey, at least it runs at a near usable speed! If I try running on CPU (I have a r9 3900) I get something closer to 1 token per second.

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago (1 children)

How did you get your 2080 ti to 22gb of VRAM?

[–] 512DuncanL@alien.top 1 points 10 months ago

Modded cards are quite easy to obtain in china

[–] WaterPecker@alien.top 1 points 10 months ago

Hopefully the proposed S-LoRa's will allow to do more with less.

[–] candre23@alien.top 1 points 10 months ago

The ONLY pascal card worth bothering with is the P40. It's not fast, but it's the cheapest way to get a whole bunch of usable vram. Nothing else from that generation is worth the effort.

[–] croholdr@alien.top 1 points 10 months ago (1 children)

I tried it. Something like 1.2 tokens inference on lamma 70b with a mix of cards (but 4 1080s). Would process would crash occasionally. Ideally every card would have the same vram.

Going to try it with 1660 TI's. I think it may be the 'sweet spot' in power to price to performance.

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago

Did you use some q3 gguf quant with this?