this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

I recently got a 32GB M1 Mac Studio. I was excited to see how big of a model it could run. It turns out that's 70B. It is a Q3_K_S model so the 2nd smallest for 70B in GGUF format, but still it's a 70B model.

As many people know, the Mac shouldn't be able to dedicate that much RAM to the GPU. Apple limits it to 67%, which is about 21GB. This model is 28GB. So it shouldn't fit. But there's a solution to that thanks to these smart people here.

https://github.com/ggerganov/llama.cpp/discussions/2182

They wrote a program to patch that limit in the kernel. You can set it to anything you want. So I cranked mine up to 92%. I also do these couple of things to save RAM.

  1. I don't use the GUI. Just simply logging in and doing nothing uses a fair amount of RAM. I run my Mac headless. I ssh in.

  2. I stopped the mds_stores process from running. I saw that it was using up between 500MB and 1GB of RAM. Its the processes that indexes the drives for faster search. Considering my drive is 97% empty, I don't know what it was doing to use up 1GB of RAM. I normally turn off indexing on all my machines always.

With all that set, the highest I've seen in use memory is 31.02GB while running a 70B Q3_K_S model. So there's headroom. There maybe a lot more. Since my goal is to not swap. I noticed that when I log into the GUI while it's running a model, the compressed RAM goes up to around 750MB but it still doesn't swap. So I wonder how far memory compression would let me stretch it. I do notice that it's not as snappy. With no GUI login, the model runs right away after the model is cached after the first run. With a GUI login, it pauses for a few seconds.

As for performance, it's 14 t/s prompt and 4 t/s generation using the GPU. It's 2 and 2 using the CPU. Power consumption is remarkably low. Using the GPU, powermetrics reports 39 watts for the entire machine but my wall monitor says it's taking 79 watts from the wall. Using the CPU powermetrics reports 36 watts and the wall monitor says 63 watts. I don't know why it's so much more efficient at the wall between GPU and CPU. It's only a 3 watt difference in the machine but 16 watts at the wall.

All in all, I'm super impressed. The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one, I think it's the best value for performance I have to run LLMs. Since I plan on running this all out 24/7/365, the power savings alone compared to anything else with a GPU will be several hundreds of dollars a year.

https://i.postimg.cc/nMjXLd9K/1.jpg

https://i.postimg.cc/8s2jfhL2/2.jpg

top 46 comments
sorted by: hot top controversial new old
[–] vlodia@alien.top 1 points 11 months ago

how about for those mere mortals with an entry level Macbook Pro M1 Pro 2022? Share a link to a friend?

[–] Musenik@alien.top 1 points 11 months ago (2 children)

Ok, this is great post! I want to use the sudo shortcut - don't mind executing it after every boot. Question though.

sudo sysctl iogpu.wired_limit_mb=

What do I replace "" with, a percentage? 85 Or a fraction? 0.85

...if I wanted to give the GPU 85 percent of my RAM?

[–] ageorgios@alien.top 1 points 11 months ago

85% of 32GB RAM would be 27200MB
I have no prior knowledge of the command though

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (1 children)

You replace it with the number MB, no a percentage. The program that patched the kernel took a percentage. This takes a number of MB. So for example, 30000 would be 30GB. A great place to get the number you need is from llama.cpp. I tells you how much RAM it needs.

This is a new development. It wasn't posted until after I started this thread. It's even better since you don't have to patch the kernel.

[–] Musenik@alien.top 1 points 11 months ago

That's exactly the info I needed! Thank you.

[–] MannowLawn@alien.top 1 points 11 months ago (2 children)

Last comment on GitHub seems a way safer option. On every reboot just do sudo sysctl iogpu.wired_limit_mb= and you’re done. No weird patching and booting and what nots.

But amazing knowledge again. Especially for the 192 gb version, this could open up some extra doors. I assume the m3 next year will be 256gb so that would bring some as we one models to the table!

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

Definitely. It's a much better way to do it for a variety of reasons. Not least of which is that the kernel patch is kernel dependent so will need to be kept up to date. Setting this system variable isn't. Unless Apple removes it. It should keep working in future releases of Mac OS.

[–] anarchos@alien.top 1 points 11 months ago

Yes, this is great news! Just the other day I was trying to get the yi-34 model running at q3 on my 24GB MacBook Air and was literally like 50 MB away from getting it to load. I had a search for a way to bump up the max allocable RAM but didn't see anything. This method works great, just tested it.

[–] WhereIsYourMind@alien.top 1 points 11 months ago (1 children)

I can run Q4 Falcon-180B on my M3 Max (40 GPU) with 128GB RAM. I get 2.5 t/s, it's crazy for a mobile chip.

[–] uti24@alien.top 1 points 11 months ago

That is pretty good! What quant do you use?

Have you tried Goliath-120B, how fast it runs? It might be even better that a Falcon-180B, so might be worth trying.

[–] astrange@alien.top 1 points 11 months ago

I stopped the mds_stores process from running. I saw that it was using up between 500MB and 1GB of RAM.

You can do this if you want but it shouldn't matter; think of that memory as counting against swap space (~4x physical RAM). Only "wired memory" counts against physical RAM. Anything going to the GPU like the ML is wired memory.

That kernel patching is kind of wacky. If you're going to do that, at least patch it in memory so you can still do OS updates. But like other comments say, the iogpu sysctl should do what you want.

[–] Alrightly@alien.top 1 points 11 months ago

I am new to Mac Studio, primarily a laptop user. How do you run it headless?

[–] DarthNebo@alien.top 1 points 11 months ago (2 children)

There's hardly any case for using the 70B chat model, most LLM tasks are happening just fine with Mistral-7b-instruct at 30tok/s

[–] Zugzwang_CYOA@alien.top 1 points 11 months ago (2 children)

7b models are absolute garbage for RP with large context and world info entries.

[–] DarthNebo@alien.top 1 points 11 months ago

Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b

[–] DarthNebo@alien.top 1 points 11 months ago

Mine are mostly summarisation & extraction work so Mistral-instruct is way better than llama13b

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

I guess you've only used a 7B model. IMO, the magic doesn't really start happening until 30B.

[–] tristam15@alien.top 1 points 11 months ago

There should be a separate Mac flair for all Mac threads.

That would help us Mac users.

[–] pulse77@alien.top 1 points 11 months ago (1 children)

You may try and run one of Q4 models without problems: because llama.cpp uses mmap to map files into memory, you can go above available RAM and because many models are sparse it will not use all mapped pages and even if it needs it, it will swap it out with other pages on demand... I was able to run falcon-180b-chat.Q6_K which uses about 141GB on a 128GB Windows PC with less than 1% SSD reads during inference... I could even run falcon-180b-chat.Q8 which uses about 182GB but in this case SSD was working heavily during inference and it was unbearably slow (0.01 tokens/second)...

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

Yes. I've done that before on my other machines. Llama.cpp in fact defaults to that. The hope for me was that since the models are sparse that the OS would cache the relevant parts of the models in RAM. So the first run through would be slow but subsequent runs would be fast since those pages are cached in RAM. How well that works or not really depends on how much RAM the OS is willing to use to cache mmap and how smartly it does it. My hope was that if it did it smarty with sparse data that it would be pretty fast. So far, my hopes haven't been realized.

[–] farkinga@alien.top 1 points 11 months ago (1 children)

As so often happens, the real LPT is in the comments. Using sysctl to change vram allocation is amazing. Thanks for this post.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

Absolutely. That is a much better way to do it. But that was a recent development. More recent than this thread. That post at github about doing it that way happened 3 hours after I posted this thread.

[–] SomeOddCodeGuy@alien.top 1 points 11 months ago (1 children)

Awesome! I think I remember us talking about this at some point, but I didn't have the courage to try it on my own machine. You're the first person I've seen actually do the deed, and now I want to as well =D The 192GB Mac Studio stops at 147GB... I also run headless, so I can't fathom that stupid bricks really needs 45GB of RAM to do normal stuff lol.

I am inspired. I'll give it a go this weekend! Great work =D

[–] robertotomas@alien.top 1 points 11 months ago

please followup for actual memory limits, if you hit them

[–] leelweenee@alien.top 1 points 11 months ago (2 children)

How do you run headless to save on the excess ram of the GUI? I've looked for on google but there only tutorials about screen sharing.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (1 children)

I just don't login using the GUI. There indeed doesn't seem to be a way to turn it off like in Linux. So it still uses up 10s of MB waiting for you to login. But that's a far cry from the 100's of MB if you do login. I have thought about killing those login in processes but the Mac is so GUI centric that if something really goes wrong and I can't ssh in, I want that as backup. I think a few 10's of MB is worth it for that instead of trying to fix things in the terminal in recovery mode.

[–] leelweenee@alien.top 1 points 11 months ago

So it still uses up 10s of MB waiting for you to login. But that's a far cry from the 100's of MB if you do login

Thanks that clears things up a lot. I'm going to try it. And then I'll look into maybe killing the login screen once I've ssh'ed in.

[–] ThisGonBHard@alien.top 1 points 11 months ago

I think it means no display in.

[–] a_beautiful_rhind@alien.top 1 points 11 months ago (1 children)

Pretty cool hack. Beats CPU inference at those speeds for sure.

[–] Aaaaaaaaaeeeee@alien.top 1 points 11 months ago (2 children)

The bandwidth utilization is not the best yet on gpu, its only 1/3rd of the potential 400GB/s.

The cpu RAM bandwidth utilization in llama.cpp on the otherhand, is nearly 100%, For my 32gb of DDR4, I get 1.5t/s with the 70B Q3_K_S model.

There will hopefully be more optimizations to speed this up.

[–] DavidSJ@alien.top 1 points 11 months ago

There will hopefully be more optimizations to speed this up.

Speculative, Jacobi, or lookahead decoding could speed things up quite a bit.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

I can't wait for ultrafastbert. If that delivers on the promise then it's a game changer that will propel CPU inference to the front of the pack. For 7B models up to a 78x speedup. The speedup decreases as the number of layers increase, but I'm hoping at 70B it'll still be pretty significant.

[–] smartid@alien.top 1 points 11 months ago (2 children)

can the ram be upgraded, guessing no since it's apple

[–] calcium@alien.top 1 points 11 months ago (1 children)

Yea, I'm pissed I can't upgrade the ram on my 3090 either. /s

[–] smartid@alien.top 1 points 11 months ago (1 children)

what a nonsensical comparison.

have you ever posted in this sub before? also what country are you posting from?

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

It's a perfectly sensical comparison. If anything, it's far easier to upgrade the RAM on a 3090 than a M Mac.

[–] Shap6@alien.top 1 points 11 months ago

you can't but its not just apple being assholes like what they did with SSD's, the "M" chips are completely integrated SOC's that use unified memory that is shared between the GPU and CPU

[–] nero10578@alien.top 1 points 11 months ago (1 children)

There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

There are no new 3090 so comparing the cost to a new 3090 is pointless as its basically just scalped overprized new 3090s left.

I'm not comparing it to the cost of a new 3090. I clearly said I was comparing it to the price of a used 3090.

"The M1 32GB Studio may be the runt of the Mac Studio lineup but considering that I paid about what a used 3090 costs on ebay for a new one"

[–] chewbie@alien.top 1 points 11 months ago

Does anyone know how many stream of LLAMA 2 70b a apple studio can run in parrallel ? Does it need the same amount of ram for each completion, or does llama.cpp manage to share it between different stream ?

[–] Zugzwang_CYOA@alien.top 1 points 11 months ago (1 children)

It's impressive, but from what I have heard on this sub the big weakness of mac computers is the time it takes to evaluate the prompt. For example, if you are roleplaying with 8k context in Sillytavern and your prompt is continuously changing as world entries are triggered and character sheets are swapped during a group chat, then the evaluation speed of the prompt matters greatly.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago (1 children)

That's where context shifting comes into play. You don't re-evaluate the entire context. You just process the additions.

"Previously, we had to re-evaluate the context when it becomes full and this could take a lot of time, especially on the CPU. Now, this is avoided by correctly updating the KV cache on-the-fly:"

https://github.com/ggerganov/llama.cpp/pull/3228

[–] Zugzwang_CYOA@alien.top 1 points 11 months ago

Interesting, so even if the middle of the context changes between replies as world entries appear and disappear in the context, it will still not force a re-evaluation of the greater whole?

[–] CheatCodesOfLife@alien.top 0 points 11 months ago (1 children)

Thanks for this, I'm about 3GB short of running Goliath-120b on my 64gb mbp.

[–] sammcj@alien.top 1 points 11 months ago (1 children)

I should probably give that a go on my 96GB M2 Max. I haven’t done much reading on that model,

[–] CheatCodesOfLife@alien.top 1 points 11 months ago

Hey mate, no need to do this now. There's a terminal command you can run instead. I did this on my M1 and it works fine.

https://old.reddit.com/r/LocalLLaMA/comments/186phti/m1m2m3_increase_vram_allocation_with_sudo_sysctl/

> sudo sysctl iogpu.wired_limit_mb=57344

I did that for my 64GB, you'd want to change the 57344 to whatever you want for your 96GB