overview for Prudent-Artichoke-19

Anything I can self host on RPi4 that will do AI image upscaling, as alt. to Google photos? in c/main@selfhosted.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago

I wrote an upscaler that runs super great on CPU. However, I have not tried ARM. Hit me up and I'll look into it.

GPU for homelab tinkering in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago

It just sucks because the sweet spot is 48GB but a single card is 3k usd at least.

At 1k you'll be stuck at 24GB for a single card.

Hardware for Meta Llama2 65b for a Web App? in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (1 children)

You need a load balancer of some sort but an A6000 would be a good start. 15-20 tps as a single user.

In vanilla form, Llama 2 may do silly stuff. Instructs, tuning, etc. will decrease the likelihood.

If you are taking something to prod, I'd advise picking up a consultant to work with you.

GPU for homelab tinkering in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (2 children)

You can cluster 3 16GB Arc A770 GPUs. That's 48GB and modern.

What kind of specs to run local llm and serve to say up to 20-50 users in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago

One or two a6000s can serve a 70b with decent tps for 20 people. You can run a swarm using petals and just add a gpu as needed. LLM sharding can be pretty useful.

Relationship of RAM to context size? in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago

Well my issues with the p40 are mainly that it's not fp16-capable, bandwidth is super low at only 346GB/s, and it requires elongation due to the added blower fan from probably ebay. I do agree it's a big brain move for budget builders though.

My issue with 24GB cards mainly stems from transferring data between two cards over PCI-e. We know that a 70b on one 48GB vs 2x 24GB will perform consistently better. Again, it's really negligible if you only have the budget for a dual 3090 build or something.

I do work extensively with OpenVINO and ONNX as a software developer so I'm not too worried about any issues with the platforms working together (I've managed to make them play nice one way or another for most things). This is actually why I was leaning more into the dual Xeon platinums or golds instead of the Epyc/Threadripper deal. PCI lanes are plentiful either way though.

For the P920, the goal would mainly be to just have a q4 or q5 70b run on the 48GB but auxiliary models like embedding fetchers, semantic stuff, QA, etc. would be on something like an a770 due to the specs-to-price ratio. I don't really need the RAM and I logically figured that I wouldn't need more than 64GB dedicated to ML functions since even AVX-512 won't make up for the slowness of running something larger, imo.

I can only see myself having more than two cards in a machine working together if I could include Nvlink or something.

Eventually most of the. Things I make will be going to prod, so I also need to make sure I keep in-mind that I'm more likely to get a good deal on cloud xeons like sapphire rapids and a single big card vs an epyc with many smaller cards.

Relationship of RAM to context size? in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (2 children)

You mention the CPU only for the PCI lanes right? I don't want to run avx-512 anymore than I have to.

All-in-all the best deal for under 6k I've found is:

A Lenovo P920 with dual 3rd gen platinums and an rtx a6000 48GB. That should include about 256GB of DDR4.

Sure I could go rtx 8000 and use the remaining thousand to buy some arc a770 for 16GB and below models to improve on the agent. But I do prefer to stay as close to modern as possible with the GPU features.

I'm assuming you were thinking p40s with the recommendations you've made?

Relationship of RAM to context size? in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (4 children)

What worthwhile multicard inference server is 6000 usd or less?

any open source LLM you want scaled to 200 gpus I will create a tutorial for in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 10 months ago

I'd eat my arms for sharding llms with node/bun

Intel neural-chat-7b-v3-1 in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 10 months ago

I feel like I woke up one day and "open" meant "closed".

Is there a technical reason that distributed LLMs don't exist? in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 10 months ago

Distributed inference IS indeed slower BUT its definitely not too slow for production use. I've used it and it's still faster than GPT4 with the proper cluster.

Is there a technical reason that distributed LLMs don't exist? in c/localllama@poweruser.forum

[–] Prudent-Artichoke-19@alien.top 1 points 10 months ago

I've used petals a ton.