overview for FlishFlashman

M1/M2/M3: increase VRAM allocation with `sudo sysctl iogpu.wired_limit_mb=12345` (i.e. amount in mb to allocate) in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.

New APU’s close to Gpu processing, but with unlimited memory? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

It's not going to help because the model data is much larger than the cache and the access pattern is basically long sequential reads.

New APU’s close to Gpu processing, but with unlimited memory? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

A chip that won't be available for ~6 months will be better than a chip that came out a year ago? Amazing ;)

How to chat with documents in/ via LM Studio? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

GPT4All is similar to LM Studio, but includes the ability to load a document library and generate text against it.

Inferencing with AND X3D Processors in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

180GB/s isn't really all that fast.

Please enlighten me, why are people building LLM Twitter bots? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (2 children)

It's a lot cheaper than paying humans to spread propaganda.

Consider that the audience isn't you, it's people who lack discernment. It's like those scam emails. People with good judgement delete them.

The other audience is engagement algorithms.

Should corporations and smaller businesses be training, refining, and running local LLMs to avoid the future cost of the new cloud services M$ and Amazon are rolling out? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

That's probably the argument for all cloud architecture.

Long-term cost and risk might be persuasive, that hasn't swayed IT managers thus far for non-LLM specific infrastructure

It's 2023. What are you talking about? Where have you been?

Wich Llama can I run with a M3 pro 36GO ? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

It doesn't

MBP M3 max for Local LLama? in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

Apple Silicon Macs are great options for running LLMs, especially so if you want to run a large LLM on a laptop. With that said, there aren't big performance differences between the M1 Max & M3 Max, at least not for text generation, prompt processing does show generational improvements.. Maybe this will change in future versions of MacOS if there are optimizations to unlock better Metal Performance Shader performance on later GPU generations, but for right now, they are pretty similar.

Apple Silicon Macs aren't currently a great option for training/fine tuning models. There isn't a lot of software support for GPU acceleration during training on Apple Silicon.

I have some questions in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

If model size is a priority, the Apple Silicon macs (particularly used or factory refurbished Mac Studio Ultras) provide good value (cost + available memory + performance. Ie 4,679 for 128GB -- 96GB usable by GPU for model + working data). Workstation or multiple high end consumer GPUs can be faster, but also more expensive, more power consumption, bigger case, louder...)

Software options for doing training or fine tuning on Macs using GPU are limited at this point, but will probably improve. This might also be something better done with short term rental of a cloud server.

Technical question about hardware limits in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago (1 children)

What are you using to run them?

In any case, larger context models require *a lot* more RAM/VRAM.

Cranking the performance on M3 Max in c/localllama@poweruser.forum

[–] FlishFlashman@alien.top 1 points 2 years ago

What quantization are you using? Smaller tends to be faster.

I get 30 tokens/s with a q4_0 quantization of 13B models on a M1 Max on Ollama (which uses llama.cpp). You should be in the same ballpark with the same software. You aren't going to do much/any better than that. The M3's GPU made some significant leaps for graphics, and little to nothing for LLMs.

Allowing more threads isn't going to help generation speed, it might improve prompt processing though. Probably best though to keep the number of threads to the number of performance cores.