I'll be interested to see what responses you get, but I'm gonna come out and say that the Mac's power is NOT its speed. Pound for pound, a CUDA video card is going to absolutely leave our machines in the dust.
So, with that said- I actually think your 20 tokens a second is kind of great. I mean- my M2 Ultra is two M2 Max processors stacked on top of each other, and I get the following for Mythomax-l2-13b:
- Llama.cpp directly:
- Prompt eval: 17.79ms per token, 56.22 tokens per second
- Eval: 28.27ms per token, 35.38 tokens per second
- 565 tokens in 15.86 seconds: 35.6 tokens per second
- Llama cpp python in Oobabooga:
- Prompt eval: 44.27ms per token, 22.59 tokens per second
- Eval: 27.92 ms per token, 35.82 tokens per second
- 150 tokens in 5.18 seconds: 28.95 tokens per second
So you're actually doing better than I'd expect an M2 Max to do.