M2 Ultra user here. I threw some numbers up for token counts: https://www.reddit.com/r/LocalLLaMA/comments/183bqei/comment/kaqf2j0/?context=3
Does a big memory let you increase the context length with smaller models where the parameters don't fill the memory?
With the 147GB of VRAM I have available, I'm pretty sure I could use all 200k tokens available in a Yi 34b model, but I'd be waiting half an hour for a result. I've done up to 50k in CodeLlama, and it took a solid 10 minutes to get a response.
The M2 Ultra's big draw is its big RAM; its not worth it unless you get the 128GB model or higher. You have to understand that the speed of the M2 ultra doesn't remotely compare to something like a 4090; CUDA cards are gonna leave us in the dust.
Another thing to consider is that we can only use ggufs via Llamacpp; there's no support for anything else. In that regard, I've seen people put together 3x or more Tesla P40 builds that have the exact same limitation (can only use Llamacpp) but cost half the price or less.
I chose the M2 Ultra because it was easy. Big VRAM, and it took me less than 30 minutes from the moment I got the box to be chatting to a 70b q8 on it. But if speed or price are a major consideration, moreso than level of effort to set up? In that case the M2 ultra would not be the answer.