I dont know if Exllama 2 supports Mac, but if it does, 70B.
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
It doesn't
I'm using M1 Max Mac Studio with 64GB of memory. I can use up to 48GB of memory as VRAM. I don't know how much memory is on your M3 Pro, so I am talking about my case. 7B models are easy. 13B and 20B models are okay. Maybe 30B models are also okay. More than 30B models are tough.
One definite thing is that you must use llama.cpp or its variant (oobabooga with llama.cpp loader, koboldcpp derived from llama.cpp) for Metal acceleration. llama.cpp and GGUF will be your friends. llama.cpp is the only one program to support Metal acceleration properly with model quantizations.
Using llama.cpp or its variants, I found that prompt evaluation, BLAS matrix calculation, is very slow especially than cuBLAS from NVIDIA CUDA Development Kit. If model parameters are bigger, the prompt evaluation times get also longer.
I heard that design of M3 GPU has quite changed, so I guess it may speed up for BLAS, but not sure...