If you're using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple's unique architecture for sharing the same high-speed RAM between CPU and GPU.
It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345
See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315
Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature ... And tbh that wasn't that great.
Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you'll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!
My M1 Max Mac Studio has 64GB of RAM. By running
sudo sysctl iogpu.wired_limit_mb=57344
, it did magic!Yay!
Yeah! That's what I'm talking about. Would you happen remember what it was reporting before? If it's like the rest, I'm assuming it said something like 40 or 45gb, right?
≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.