Potentially dumb but related question:
I know the Mac M* Series chips can use up to ~70% of their universal RAM for "GPU" (VRAM) purposes. The 20GB used to load up a Yi-34B model just about uses all of that up.
So: given I still have maybe 8GB of remainder RAM to work with (assuming I leave 4GB for the system), would I be able to apply a 128K context buffer and have that located in "normal" RAM?
I'm assuming the heavy computational load is performed on the inferencing itself, and the model itself would be loaded in "VRAM" and the GPU side of the chip handles that - but can the context buffer be loaded and work at a decent speed in the remaining RAM? Or does everything - the context buffer and model - both have to use "VRAM" to work at a decent speed?
Thanks for confirming this. I've seen so much praise for these models, yet I've experienced no end of problems in trying to get decent, consistent output. A couple of Yi finetunes seem better than others, but there are still too many problems for me to prefer them over others (for RP/chat purposes.)
I'm still hopeful it's just a matter of time (and a fair amount of trial-and-error) before myself, app developers and model mixers, work out how to get fantastic, consistent out-of-the-box results.