LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

M1/M2/M3: increase VRAM allocation with sudo sysctl iogpu.wired_limit_mb=12345 (i.e. amount in mb to allocate) (alien.top)

submitted 11 months ago by farkinga@alien.top to c/localllama@poweruser.forum

17 comments fedilink hide all child comments

If you're using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple's unique architecture for sharing the same high-speed RAM between CPU and GPU.

It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345

See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315

Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature ... And tbh that wasn't that great.

Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you'll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!

you are viewing a single comment's thread
view the rest of the comments

[–] Zugzwang_CYOA@alien.top 1 points 11 months ago (1 children)

How is the prompt processing time on a mac? If I were to work with a prompt that is 8k in size for RP, with big frequent changes in the prompt, would it be able to read my ever-changing prompt in a timely manner and respond?

I would like to use Sillytavern as my front end, and that an result in big prompt changes between replies.

[–] bebopkim1372@alien.top 1 points 11 months ago (1 children)

For M1, when prompt evaluations occur, BLAS operation is used and the speed is terrible. I also have a PC with 4060 Ti 16GB, and cuBLAS is the speed of light compared with BLAS speed on my M1 Max. BLAS speeds under 30B modles are acceptable, but more than 30B, it is really slow.

[–] Zugzwang_CYOA@alien.top 1 points 11 months ago

Good to know. It sounds like macs are great at asking simple questions of powerful LLMs, but not so great at roleplaying with large context stories. I had hoped that an M2 Max would be viable for RP at 70b or 120b, but I guess not.