farkinga

joined 10 months ago
[–] farkinga@alien.top 1 points 9 months ago (5 children)

Yeah! That's what I'm talking about. Would you happen remember what it was reporting before? If it's like the rest, I'm assuming it said something like 40 or 45gb, right?

 

If you're using Metal to run your llms, you may have noticed the amount of VRAM available is around 60%-70% of the total RAM - despite Apple's unique architecture for sharing the same high-speed RAM between CPU and GPU.

It turns out this VRAM allocation can be controlled at runtime using sudo sysctl iogpu.wired_limit_mb=12345

See here: https://github.com/ggerganov/llama.cpp/discussions/2182#discussioncomment-7698315

Previously, it was believed this could only be done with a kernel patch - and that required disabling a macos security feature ... And tbh that wasn't that great.

Will this make your system less stable? Probably. The OS will need some RAM - and if you allocate 100% to VRAM, I predict you'll encounter a hard lockup, spinning Beachball, or just a system reset. So be careful to not get carried away. Even so, many will be able to get a few more gigs this way, enabling a slightly larger quant, longer context, or maybe even the next level up in parameter size. Enjoy!

[–] farkinga@alien.top 1 points 9 months ago (1 children)

As so often happens, the real LPT is in the comments. Using sysctl to change vram allocation is amazing. Thanks for this post.

[–] farkinga@alien.top 1 points 10 months ago

The words you see were generated by a neural network based on the words it was trained on. That text is not related to the intentions or capabilities of the model.

Since it is running in gpt4all, we can see from the source code that the model cannot call functions. Therefore, the model cannot "do" anything; it just generates text.

If, for example, the model said it was buying a book from a website, that doesn't mean anything. We know it can't do that because the code running the model doesn't provide that kind of feature. The model lives inside a sandbox, cut off from the outside world.

[–] farkinga@alien.top 1 points 10 months ago (1 children)

Nice post. This got me thinking...

While many commenters are discussing the computation aspect, which leads to petals and the horde, I am thinking about bit torrent (since you mentioned it).

We do need a hub for torrenting LLMs. HF is amazing for their bandwidth (okay for the UI) - but once that VC money dries up, we'll be on our own. So, distributing the models - just the data, not the computation - is also important.