I'm looking to get an M3 Max, for reasons other than inference, but would like to factor in for local LLMs and would really appreciate some thoughts on some configurations. The models I'm looking at are:
- a) M3 Max 16" 64GB with 16/CPU, 40/GPU, 16/NE, 400GBs memory bandwidth
- b) M3 Max 16" 96GB with 14/CPU, 40/GPU, 16/NE, 300GBs memory bandwidth
- c) M3 Max 14" 96GB with 14/CPU, 40/GPU, 16/NE, 300GBs memory bandwidth
(moving up to the 14" and 16" 128GB options with 15/40/16 I want to keep off the table, price point wise.)My sense is the main LLM tradeoff comes down to ram and bandwidth, with ram dictating what models can effectively be loaded and bandwidth dictating speed and token/ps. My intuition, very possibly wrong, is that if responsiveness or good interactivity isn't the main factor, I can prefer ram over bandwidth. That said, I would like to use 70B models if possible, but I'm also unclear whether 64GB ram can hoist 70B models, just heavily quantized ones, or none at all.
I did see a few posts suggesting that's possible but not specifically for the M3 line of configs above (apologies if I missed them):
- M2 Max
- M3 Max, 128GB:
- M3vsM2 Max
- LLM Performance on M3 Max
The main non-LLM factor is a larger screen and the default choice absent LLMs is option a since 64GB covers other workloads, but wanting headroom for 70B or so LLMs is leaning me to option b which trades up on ram and down on bandwidth, since interactivity is (probably) less important atm than model size, but I'm aware I might have built up some bad assumptions.
From those I would choose C.