this post was submitted on 29 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

If you're using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple's unique architecture for sharing the same high-speed RAM between CPU and GPU.

It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345

See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315

Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature ... And tbh that wasn't that great.

Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you'll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!

you are viewing a single comment's thread
view the rest of the comments
[–] CheatCodesOfLife@alien.top 1 points 11 months ago (1 children)

That totally worked. I can run goliath 120b on my m1 max laptop now. Thanks a lot.

[–] Zestyclose_Yak_3174@alien.top 1 points 11 months ago (1 children)

Which quant did you use and how was your experience?

[–] CheatCodesOfLife@alien.top 1 points 11 months ago

46G goliath-120b.Q2_K

So the smallest one I found (I didn't quantize this one myself, found it on HF somewhere)

And it was very slow. about 13t/s prompt_eval and then 2.5t/s generating text, so only really useful for me when I need to run it on my laptop (I get like 15t/s with 120b model on my 2x3090 rig at 3bpw exl2)
As for the models it's self, I like it a lot and use it frequently.

TBH, this ram thing is more helpful for me because it lets me run Q5 70b models instead of just Q4 now.