WhereIsYourMind

joined 10 months ago
[–] WhereIsYourMind@alien.top 1 points 9 months ago

I run a code completion server that works like GitHub Copilot. I'm also working on an Mail labeling system using llamacpp and AppleScript, but it is very much a work-in-progress.

[–] WhereIsYourMind@alien.top 1 points 9 months ago (1 children)

I can run Q4 Falcon-180B on my M3 Max (40 GPU) with 128GB RAM. I get 2.5 t/s, it's crazy for a mobile chip.

[–] WhereIsYourMind@alien.top 1 points 9 months ago (2 children)

I have the M3 Max with 128GB memory / 40 GPU cores.

You have to load a kernel extension to allocate more than 75% of the total SoC memory (128GB * 0.75 = 96GB) to the GPU. I increased it to 90% (115GB) and can run falcon-180b Q4_K_M at 2.5 tokens/s.