this post was submitted on 26 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hi everyone, just received my new macbook pro with M3 pro (36GO) and I'm willing to have fun with a local Llama now that I finally have a machine able to run it !

Main question is : wich version can I run without burning my chimp ? 13b ? 70b ?

Also if you have any usefulls ressources you can suggest to get into this game feel free to share I'm a real beginner in LLM. I'm a young python dev for fun and a javascript dev for food and mainly want to have a local llama to assist me on web dev !

Thanks to guys like Alex Ziskind I'm aware it's not really reliable for coding assist for now but I'm really curious about what it can do, if you have better tool I could set up on my machine to assist me feel free to share and I'm really willing to devour some ressources about this subject.

Thank you all for your help and advises !

top 3 comments
sorted by: hot top controversial new old
[–] ThisGonBHard@alien.top 1 points 11 months ago (1 children)

I dont know if Exllama 2 supports Mac, but if it does, 70B.

[–] FlishFlashman@alien.top 1 points 11 months ago
[–] bebopkim1372@alien.top 1 points 11 months ago

I'm using M1 Max Mac Studio with 64GB of memory. I can use up to 48GB of memory as VRAM. I don't know how much memory is on your M3 Pro, so I am talking about my case. 7B models are easy. 13B and 20B models are okay. Maybe 30B models are also okay. More than 30B models are tough.

One definite thing is that you must use llama.cpp or its variant (oobabooga with llama.cpp loader, koboldcpp derived from llama.cpp) for Metal acceleration. llama.cpp and GGUF will be your friends. llama.cpp is the only one program to support Metal acceleration properly with model quantizations.

Using llama.cpp or its variants, I found that prompt evaluation, BLAS matrix calculation, is very slow especially than cuBLAS from NVIDIA CUDA Development Kit. If model parameters are bigger, the prompt evaluation times get also longer.

I heard that design of M3 GPU has quite changed, so I guess it may speed up for BLAS, but not sure...