this post was submitted on 26 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I'm using M1 Max Mac Studio with 64GB of memory. I can use up to 48GB of memory as VRAM. I don't know how much memory is on your M3 Pro, so I am talking about my case. 7B models are easy. 13B and 20B models are okay. Maybe 30B models are also okay. More than 30B models are tough.
One definite thing is that you must use llama.cpp or its variant (oobabooga with llama.cpp loader, koboldcpp derived from llama.cpp) for Metal acceleration. llama.cpp and GGUF will be your friends. llama.cpp is the only one program to support Metal acceleration properly with model quantizations.
Using llama.cpp or its variants, I found that prompt evaluation, BLAS matrix calculation, is very slow especially than cuBLAS from NVIDIA CUDA Development Kit. If model parameters are bigger, the prompt evaluation times get also longer.
I heard that design of M3 GPU has quite changed, so I guess it may speed up for BLAS, but not sure...