irregardless

joined 1 year ago
[–] irregardless@alien.top 1 points 11 months ago

Totally feasible to run LLMs at useful speeds. I'm running a 64gb 10/32 M1 Max. With LM Studio, I typically get

  • 3-4 T/s using q5_k_m quants of ~70B models
  • 6-9 T/s from q5_* and q6_k quants of ~30G models
  • 25-30 T/s from q6_k and q8 quants of 7B models
  • around 20 T/s from unquantized fp16 7B models

And this is my daily work and play machine, so I usually have all sorts of browser tabs and applications open simultaneously while running the models. From a fresh boot, it's cool to be able to load an entire model into memory and still be able to do "normal" work without having to use any swap space at all.

[–] irregardless@alien.top 1 points 11 months ago

I'll go out on a limb and say that no one has compiled a glossary or encyclopedia of the various fine-tunes that seem to get published every day (if I'm wrong I'm sure someone will correct me). If you're not connected to "the scene", or working with these models academically/professionally, it can be hard to become and stay initiated into the "secret" jargon that's developed around local LLM. You can pick up a lot just by hanging out here, but you'll still run into quite a few things that make you ask "wtf does that mean?".

[–] irregardless@alien.top 1 points 11 months ago

I must be incredibly lucky, or I'm unknowingly some kind of prompting savant, because Claude, et al usually just do what I ask them to.

The only time Claude outright refused a request was when I was looking for some criticism about a public figure of recent history as a place to begin some research. But even that was a straightforward workaround using the "I'm writing a novel based this person" stratagem.