FlishFlashman

joined 10 months ago
[–] FlishFlashman@alien.top 1 points 9 months ago

≥64GB allows 75% to be used by GPU. ≤32 its ~66%. Not sure about the 36GB machines.

[–] FlishFlashman@alien.top 1 points 9 months ago (1 children)

It's not going to help because the model data is much larger than the cache and the access pattern is basically long sequential reads.

[–] FlishFlashman@alien.top 1 points 9 months ago

A chip that won't be available for ~6 months will be better than a chip that came out a year ago? Amazing ;)

[–] FlishFlashman@alien.top 1 points 9 months ago (1 children)

GPT4All is similar to LM Studio, but includes the ability to load a document library and generate text against it.

[–] FlishFlashman@alien.top 1 points 9 months ago (1 children)

180GB/s isn't really all that fast.

[–] FlishFlashman@alien.top 1 points 9 months ago (2 children)

It's a lot cheaper than paying humans to spread propaganda.

Consider that the audience isn't you, it's people who lack discernment. It's like those scam emails. People with good judgement delete them.

The other audience is engagement algorithms.

[–] FlishFlashman@alien.top 1 points 9 months ago (1 children)

That's probably the argument for all cloud architecture.

Long-term cost and risk might be persuasive, that hasn't swayed IT managers thus far for non-LLM specific infrastructure

It's 2023. What are you talking about? Where have you been?

[–] FlishFlashman@alien.top 1 points 9 months ago

Apple Silicon Macs are great options for running LLMs, especially so if you want to run a large LLM on a laptop. With that said, there aren't big performance differences between the M1 Max & M3 Max, at least not for text generation, prompt processing does show generational improvements.. Maybe this will change in future versions of MacOS if there are optimizations to unlock better Metal Performance Shader performance on later GPU generations, but for right now, they are pretty similar.

Apple Silicon Macs aren't currently a great option for training/fine tuning models. There isn't a lot of software support for GPU acceleration during training on Apple Silicon.

[–] FlishFlashman@alien.top 1 points 10 months ago (1 children)

If model size is a priority, the Apple Silicon macs (particularly used or factory refurbished Mac Studio Ultras) provide good value (cost + available memory + performance. Ie 4,679 for 128GB -- 96GB usable by GPU for model + working data). Workstation or multiple high end consumer GPUs can be faster, but also more expensive, more power consumption, bigger case, louder...)

Software options for doing training or fine tuning on Macs using GPU are limited at this point, but will probably improve. This might also be something better done with short term rental of a cloud server.

[–] FlishFlashman@alien.top 1 points 10 months ago (1 children)

What are you using to run them?

In any case, larger context models require *a lot* more RAM/VRAM.

[–] FlishFlashman@alien.top 1 points 10 months ago

What quantization are you using? Smaller tends to be faster.

I get 30 tokens/s with a q4_0 quantization of 13B models on a M1 Max on Ollama (which uses llama.cpp). You should be in the same ballpark with the same software. You aren't going to do much/any better than that. The M3's GPU made some significant leaps for graphics, and little to nothing for LLMs.

Allowing more threads isn't going to help generation speed, it might improve prompt processing though. Probably best though to keep the number of threads to the number of performance cores.

view more: next ›