overview for fallingdowndizzyvr

Optimum Intel OpenVino Performance in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

That's why Intel is pitching OneAPI. They want it to be the single API to bring everything together. That's why it also supports nvidia GPUs, AMD GPUs, CPUs and even FPGA.

Is m1 max macbook pro worth? in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

Yes, that M1 Max should running LLMs really well including 70B with decent context. A M2 won't be much better. A M3, other than the 400GB/s model, won't be as good. Since everything but the 400GB/s has had the memory bandwidth cut from the M1/M2 models.

Are you seeing that $2400 at B&H? It was $200 cheaper there a couple of weeks ago. It might be worth it to see if the price goes back down.

Optimum Intel OpenVino Performance in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (2 children)

There are quite a few Intel projects in AI. There's also the optimized DirectML they made with Microsoft. So anything that supports DirectML should also be well optimized on Intel hardware. Both CPUs and GPUs.

How to upgrade to the next VRAM breakpoints, and is it worth it? in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

The easiest thing to do is to get a Mac Studio. It also happens to be the best value. 3x4090s at $1600 each is $4800. That's just for the cards. Adding a machine to put those cards into will cost another few hundred dollars. Just the cost of 3x4090s put you into Mac Ultra 128GB range. Adding the machine to put those cards into puts you in Mac Ultra 192GB range. With those 3x4090s you only have 72GB of RAM. Both those Mac options give you much more RAM.

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

There's no point to it. Since if it's too big to fit in RAM, it would be disk i/o that would be the limiter. Then it wouldn't matter if you had 400GB/s of memory bandwidth or 40GB/s. Since the disk i/o would be the bottleneck.

Fitting 70B models in a 4gb GPU, The whole model, no quants or distil or anything! in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (2 children)

Because it wouldn't be any faster than doing CPU inference. Since both CPUs and GPUs are already waiting around for data to process. It's that i/o that's the limiter. This changes none of that.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (1 children)

That's where context shifting comes into play. You don't re-evaluate the entire context. You just process the additions.

"Previously, we had to re-evaluate the context when it becomes full and this could take a lot of time, especially on the CPU. Now, this is avoided by correctly updating the KV cache on-the-fly:"

https://github.com/ggerganov/llama.cpp/pull/3228

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate) in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (1 children)

So I guess the default is: 49152

It is. To be more clear, llama.cpp tells you want the recommendedMaxWorkingSetSize is. Which should match that number.

1

Looking for a 3090 to LLM on? ZOTAC GAMING GeForce RTX 3090 Ti AMP Extreme Holo [Open Box] 2 Year Warranty - $864. (alien.top)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

2 comments fedilink

For those looking for a 3090, this is a pretty good way to go. It's from the manufacturer and not some random person on ebay. Just regular used 3090s sell for over $800 on ebay now. This is a 3090ti and comes with a 2 year warranty from the manufacturer.

https://www.zotacstore.com/us/zt-a30910b-10p-o

If you are looking for a regular 3090 for less money. They also have those as low as $711 but they are OOS right now. So launch a bot and have it watch that page like a hawk. Since when they do drop stock, it's like BF and over in seconds/minutes.

https://www.zotacstore.com/us/zt-a30900c-10p-o

Two sets of base models from China (Yuan 2.0-2B, 51B, 102B and XVERSE-7B, 13B, 65B) in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

I'm really interested in having a 51B model. I would love something between 34B and 65/70B.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

Absolutely. That is a much better way to do it. But that was a recent development. More recent than this thread. That post at github about doing it that way happened 3 hours after I posted this thread.

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate) in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

As per the latest developments in that discussion, "iogpu.wired_limit_mb" only works on Sonoma. So if you are on an older version of Mac OS, try "debug.iogpu.wired_limit" instead.

Macs with 32GB of memory can run 70B models with the GPU. in c/localllama@poweruser.forum

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

I guess you've only used a 7B model. IMO, the magic doesn't really start happening until 30B.

1

Macs with 32GB of memory can run 70B models with the GPU. (alien.top)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

46 comments fedilink

I recently got a 32GB M1 Mac Studio. I was excited to see how big of a model it could run. It turns out that's 70B. It is a Q3_K_S model so the 2nd smallest for 70B in GGUF format, but still it's a 70B model.

As many people know, the Mac shouldn't be able to dedicate that much RAM to the GPU. Apple limits it to 67%, which is about 21GB. This model is 28GB. So it shouldn't fit. But there's a solution to that thanks to these smart people here.

https://github.com/ggerganov/llama.cpp/discussions/2182

They wrote a program to patch that limit in the kernel. You can set it to anything you want. So I cranked mine up to 92%. I also do these couple of things to save RAM.

I don't use the GUI. Just simply logging in and doing nothing uses a fair amount of RAM. I run my Mac headless. I ssh in.
I stopped the mds_stores process from running. I saw that it was using up between 500MB and 1GB of RAM. Its the processes that indexes the drives for faster search. Considering my drive is 97% empty, I don't know what it was doing to use up 1GB of RAM. I normally turn off indexing on all my machines always.

With all that set, the highest I've seen in use memory is 31.02GB while running a 70B Q3_K_S model. So there's headroom. There maybe a lot more. Since my goal is to not swap. I noticed that when I log into the GUI while it's running a model, the compressed RAM goes up to around 750MB but it still doesn't swap. So I wonder how far memory compression would let me stretch it. I do notice that it's not as snappy. With no GUI login, the model runs right away after the model is cached after the first run. With a GUI login, it pauses for a few seconds.

As for performance, it's 14 t/s prompt and 4 t/s generation using the GPU. It's 2 and 2 using the CPU. Power consumption is remarkably low. Using the GPU, powermetrics reports 39 watts for the entire machine but my wall monitor says it's taking 79 watts from the wall. Using the CPU powermetrics reports 36 watts and the wall monitor says 63 watts. I don't know why it's so much more efficient at the wall between GPU and CPU. It's only a 3 watt difference in the machine but 16 watts at the wall.

All in all, I'm super impressed. The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one, I think it's the best value for performance I have to run LLMs. Since I plan on running this all out 24/7/365, the power savings alone compared to anything else with a GPU will be several hundreds of dollars a year.

https://i.postimg.cc/nMjXLd9K/1.jpg

https://i.postimg.cc/8s2jfhL2/2.jpg

1

The Acer Intel A770 16GB GPU is now $250. You won't find a better new 16GB GPU for less. (alien.top)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

22 comments fedilink

Amazon has the Acer A770 on sale for $250. That's a lot of compute with 16GB of VRAM for $250. There is no better value. It does have it's challenges. Somethings like MLC Chat run with no fuss just like on any other card. Other things need some effort like Oob, Fastchat and BigDL. But support for it is getting better and better everyday. At this price, I'm tempted to get another. I have seen some reports of running multi-GPU setups with the A770.

It also comes with Assassins Mirage for those people that still use their GPUs to game.

https://www.amazon.com/dp/B0BHKNK84Y

1

OpenAI brings Sam Altman back as CEO (www.cnbc.com)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

0 comments fedilink

1

OpenAI brings Sam Altman back as CEO (www.cnbc.com)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

15 comments fedilink

1

667 of OpenAI's 770 employees have threaten to quit. Microsoft says they all have jobs at Microsoft if they want them. (www.cnbc.com)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

65 comments fedilink

1

Sam Altman out as CEO of OpenAI. Mira Murati is the new CEO. (www.cnbc.com)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

53 comments fedilink

1

Microsoft announced the Maia 100 AI Accelerator Chip. It's also expanding the use of the AMD MI300 in it's datacenters. Is this the beginning of the end of CUDA dominance? (news.microsoft.com)

submitted 11 months ago by fallingdowndizzyvr@alien.top to c/localllama@poweruser.forum

9 comments fedilink