LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Context window testing with limited memory (alien.top)

submitted 2 years ago by TelloLeEngineer@alien.top to c/localllama@poweruser.forum

1 comments fedilink hide all child comments

I was hoping to run some context window retrieval testing on open source long context models such as Yarn-Mistral-130k but I'm only working with a 16BG Mac M2. Does anyone have experience with inference on such a setup?

I have a automated evaluation script to generate various contexts and retrieval prompts, iterating over context lengths. I was hoping to be able to call the model iteratively in this script; what would be your preferred method to achieve this? llama.cpp? oogabooga? anything else?

you are viewing a single comment's thread
view the rest of the comments

[–] FlishFlashman@alien.top 1 points 2 years ago

When I load yarn-mistral-64k in Ollama (uses llama.cpp) on my 32GB MAC it allocates 16.359 GB for the GPU. I don't remember how much the 128k context version needs, but it was more than the 21.845GB MacOS allows for the GPUs use on a 32GB machine. You aren't going to get very far on a 16GB machine.

Maybe if you don't send any layers to the GPU and force it to use CPU you could eek out a little more. On Apple Silicon CPU inference only seems to be a 50% hit over GPU speeds, if I remember right.