overview for fallingdowndizzyvr

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

Yes. I've done that before on my other machines. Llama.cpp in fact defaults to that. The hope for me was that since the models are sparse that the OS would cache the relevant parts of the models in RAM. So the first run through would be slow but subsequent runs would be fast since those pages are cached in RAM. How well that works or not really depends on how much RAM the OS is willing to use to cache mmap and how smartly it does it. My hope was that if it did it smarty with sparse data that it would be pretty fast. So far, my hopes haven't been realized.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

It's a perfectly sensical comparison. If anything, it's far easier to upgrade the RAM on a 3090 than a M Mac.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

Definitely. It's a much better way to do it for a variety of reasons. Not least of which is that the kernel patch is kernel dependent so will need to be kept up to date. Setting this system variable isn't. Unless Apple removes it. It should keep working in future releases of Mac OS.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago (1 children)

You replace it with the number MB, no a percentage. The program that patched the kernel took a percentage. This takes a number of MB. So for example, 30000 would be 30GB. A great place to get the number you need is from llama.cpp. I tells you how much RAM it needs.

This is a new development. It wasn't posted until after I started this thread. It's even better since you don't have to patch the kernel.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.

I'm not comparing it to the cost of a new 3090. I clearly said I was comparing it to the price of a used 3090.

"The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one"

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

I can't wait for ultrafastbert. If that delivers on the promise then it's a game changer that will propel CPU inference to the front of the pack. For 7B models up to a 78x speedup. The speedup decreases as the number of layers increase, but I'm hoping at 70B it'll still be pretty significant.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago (1 children)

I just don't login using the GUI. There indeed doesn't seem to be a way to turn it off like in Linux. So it still uses up 10s of MB waiting for you to login. But that's a far cry from the 100's of MB if you do login. I have thought about killing those login in processes but the Mac is so GUI centric that if something really goes wrong and I can't ssh in, I want that as backup. I think a few 10's of MB is worth it for that instead of trying to fix things in the terminal in recovery mode.

China is retrofitting consumer RTX4090s with 2 slot blower for ML in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

Although I don't doubt you, the rendering looks as fake as it gets.

Here's the attribution for that image, "Social media is abuzz with a screengrab of a regional webpage of the NVIDIA website purporting a "GeForce RTX 3090 CEO Edition" graphics card. "

So tell nvidia to up their rendering game. They should know a little something about graphics or at least know someone that does.

But is there a way to fankenstein more RAM on a existing 3090? Are there shops I could send mine to?

Supposedly those exact frankensteins are available in China. A poster here on this sub has reported buying some. If you were in China, you could take your 3090 to any of the endless Chinese tech center booths with dudes with the skills and equipment to try to do it. I would ask if they've done it before though. You don't want to be the one they learn on.

China is retrofitting consumer RTX4090s with 2 slot blower for ML in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

I mean Taiwan isn't exactly jumping at the chance to build advance chips for the rogue provences of West Taiwan, that constantly warn of their impeding violent invasion of Taiwan.

Ah... then why are they trying to convince the US to give them a license to run factories in China. TSMC, you know those Taiwan chipmakers, want the US to give them permanent license to run factories those "rogue provences of West Taiwan".

You know what a good way is to keep someone from invading you? Be so crucial that they don't want to even risk damaging you in any way. They would be making chips for them right now if the US allowed it.

China is retrofitting consumer RTX4090s with 2 slot blower for ML in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago (3 children)

Because 4090s are faster. Companies don't use these things for inferring like most people do at home. That's low compute and basically memory bandwidth dependent. Companies use these for training. Which is high compute. A 4090 is much faster than a 3090.

And they are busy putting 48GB on those 3090s.

https://www.techpowerup.com/img/erPhoONBSBprjXvM.jpg

China is retrofitting consumer RTX4090s with 2 slot blower for ML in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

It depends on what you do with it. I think they can be very useful. Check my post elsewhere in this thread.

https://www.reddit.com/r/LocalLLaMA/comments/183na9z/china_is_retrofitting_consumer_rtx4090s_with_2/kasawk5/

China is retrofitting consumer RTX4090s with 2 slot blower for ML in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

The used market only a year ago was flooded with things like mi25's and above that were being liquidated.

The MI25 is finally getting the love it deserves. I wish I had bought more when they were $65-$70 a few months ago. But I was hoping they would go lower. Even last month or so, I think I saw that they were $90. Right now, I just checked before posting, the seller with the most is selling them for $160. Crazy.

By the way, the one I got is in really good shape. As in really good. If the seller told me they were new, I would believe it. There's not a speck of dust on it. Like no where and I looked deep into the fins of the heatsink. Even the fingers on the slot looked basically new.

The only upside compared to the Crypto boom I guess is that with AI based use cases is that PCIe bus speeds matter and this is stopping people buying anything and everything then slapping 8 GPU's in an AI mining rig.

I don't think that's blanket true. I think it really depends what you do with it. I can think of a couple of uses off the top of my head where 8 GPUs sitting on yanky PCIe 1x would be fine.

Use them as a team. Nothing says you can only use them to infer one large model. You can run 8 7b-13b models. One model per card. The 1x speed wouldn't really matter in that case after the model is loaded. Having a team of small models run instead of 1 large model is a valid way to go.
Batch process 8 different prompts on a large model spread across the GPUs. Since inference is sequential, only 1 GPU is active at a time when only processing a prompt. The others 7 GPUs are idle. Don't let them idle. Vectorize it. Process 8 or more prompts at the same time. Once the vector is full, all 8 GPUs will be running. One the t/s for any one prompt won't be fast. The overall throughput t/s for all the prompts will be. It would be best to keep the prompts coming and thus the vector full to keep all GPUs running. So a good application for this is on a server that is inferring multiple prompts from multiple users. Or multiple prompts from the same user. Or the same prompt 8 different times. Since you can as the same model the same question 8 times and get 8 different answers. Let it process it 8 times and pick the best answer.
There are techniques that can allow for inference to be paralyzed. That may run great on a mining rig with 8 GPUs.

So it's far from useless to repurpose an old mining rig. You just have to be creative.