LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Cranking the performance on M3 Max (alien.top)

submitted 2 years ago by blackstonewine@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

I'm running TheBloke/Llama-2-13B-chat-GGUF on my 14 CPU/30GPU 36GB Ram M3 Max via Text generation web UI.

I get max 20 tokens/second. I've tried various parameter presets and they all seem to get me around the same 20 toks/sec.

I've tried increasing the threads to 14 and n-GPU-layers to 128. I don't really understand most of the parameters in the model and parameters tab. Just cranking different options to see how I can increase toks/sec. But so far, nothing above 20 toks/sec.

What can I change to crank the performance here? I'm yet to hear the fan go off for this 13B model. I'm trying to push it to the max to see the max toks/sec I can achieve on my machine. Any settings I should try?

you are viewing a single comment's thread
view the rest of the comments

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

You don't say what quant you are using, if any. But on Q4K_M I get this on my M1 Max using pure llama.cpp.

llama_print_timings: prompt eval time = 246.97 ms / 10 tokens ( 24.70 ms per token, 40.49 tokens per second)

llama_print_timings: eval time = 28466.45 ms / 683 runs ( 41.68 ms per token, 23.99 tokens per second)

Your M3 has lower memory bandwidth than my M1. It's the 300GB/s version versus 400GB/s.