LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

China is retrofitting consumer RTX4090s with 2 slot blower for ML (www.tomshardware.com)

submitted 2 years ago by --dany--@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

This is the reason why you can't find ones in your local best buy. They are paying premium for it. But it indeed is very helpful, if I can get my hand on a few for my build.

you are viewing a single comment's thread
view the rest of the comments

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

The used market only a year ago was flooded with things like mi25's and above that were being liquidated.

The MI25 is finally getting the love it deserves. I wish I had bought more when they were $65-$70 a few months ago. But I was hoping they would go lower. Even last month or so, I think I saw that they were $90. Right now, I just checked before posting, the seller with the most is selling them for $160. Crazy.

By the way, the one I got is in really good shape. As in really good. If the seller told me they were new, I would believe it. There's not a speck of dust on it. Like no where and I looked deep into the fins of the heatsink. Even the fingers on the slot looked basically new.

The only upside compared to the Crypto boom I guess is that with AI based use cases is that PCIe bus speeds matter and this is stopping people buying anything and everything then slapping 8 GPU's in an AI mining rig.

I don't think that's blanket true. I think it really depends what you do with it. I can think of a couple of uses off the top of my head where 8 GPUs sitting on yanky PCIe 1x would be fine.

Use them as a team. Nothing says you can only use them to infer one large model. You can run 8 7b-13b models. One model per card. The 1x speed wouldn't really matter in that case after the model is loaded. Having a team of small models run instead of 1 large model is a valid way to go.
Batch process 8 different prompts on a large model spread across the GPUs. Since inference is sequential, only 1 GPU is active at a time when only processing a prompt. The others 7 GPUs are idle. Don't let them idle. Vectorize it. Process 8 or more prompts at the same time. Once the vector is full, all 8 GPUs will be running. One the t/s for any one prompt won't be fast. The overall throughput t/s for all the prompts will be. It would be best to keep the prompts coming and thus the vector full to keep all GPUs running. So a good application for this is on a server that is inferring multiple prompts from multiple users. Or multiple prompts from the same user. Or the same prompt 8 different times. Since you can as the same model the same question 8 times and get 8 different answers. Let it process it 8 times and pick the best answer.
There are techniques that can allow for inference to be paralyzed. That may run great on a mining rig with 8 GPUs.

So it's far from useless to repurpose an old mining rig. You just have to be creative.