this post was submitted on 26 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Potentially dumb but related question:
I know the Mac M* Series chips can use up to ~70% of their universal RAM for "GPU" (VRAM) purposes. The 20GB used to load up a Yi-34B model just about uses all of that up.
So: given I still have maybe 8GB of remainder RAM to work with (assuming I leave 4GB for the system), would I be able to apply a 128K context buffer and have that located in "normal" RAM?
I'm assuming the heavy computational load is performed on the inferencing itself, and the model itself would be loaded in "VRAM" and the GPU side of the chip handles that - but can the context buffer be loaded and work at a decent speed in the remaining RAM? Or does everything - the context buffer and model - both have to use "VRAM" to work at a decent speed?