overview for Sea_Particular

Quantizing 70b models to 4-bit, how much does performance degrade? in c/localllama@poweruser.forum

[–] Sea_Particular_4014@alien.top 1 points 2 years ago

Well... none at all if you're happy with ~1 token per second or less using GGUF CPU inference.

I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

You'd need 2 x 3090 or an A6000 or something to do it quickly.

most powerful model for an A6000? in c/localllama@poweruser.forum

[–] Sea_Particular_4014@alien.top 1 points 2 years ago

If you're on Windows, I'd download KoboldCPP and TheBloke's GGUF models from HuggingFace.

Then you just launch KoboldCPP, select the .gguf file, select your GPU, enter the number of layers to offload, set the context size (4096 for those), etc and launch it.

Then you're good to start messing around. Can use the Kobold interface that'll pop up or use it through the API with something like SillyTavern.

Quantizing 70b models to 4-bit, how much does performance degrade? in c/localllama@poweruser.forum

[–] Sea_Particular_4014@alien.top 1 points 2 years ago (11 children)

Adding into Automata's theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it'll depend on your task.

It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you're doing programming or data classification or similar you might be better off with a high precision smaller model that's been fine-tuned towards the task at hand.

I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn't find the 70b that much better (or better at all) were doing programming or data classification type stuff.

Chassis only has space for 1 GPU - Llama 2 70b possible on a budget? in c/localllama@poweruser.forum

[–] Sea_Particular_4014@alien.top 1 points 2 years ago (1 children)

Your 512GB of RAM is overkill. Those Xeons are probably pretty mediocre for this sort of thing due to the slow memory, unfortunately.

With a 4090 or 3090, you should get about 2 tokens per second with GGUF q4_k_m inference. That's what I do and find it tolerable but it depends on your use case.

You'd need a 48GB GPU, or fast DDR5 RAM to get faster generation than that.

Cheapest way to run local LLMs? in c/localllama@poweruser.forum

[–] Sea_Particular_4014@alien.top 1 points 2 years ago

Q4_0 and Q4_1 would both be legacy.

The k_m is the new "k quant" (I guess it's not that new anymore, it's been around for months now).

The idea is that the more important layers are done at a higher precision, while the less important layers are done at a lower precision.

It seems to work well, thus why it has become the new standard for the most part.

Q4_k_m does the most important layers at 5 bit and the less important ones at 4 bit.

It is closer in quality/perplexity to q5_0, while being closer in size to q4_0.

most powerful model for an A6000? in c/localllama@poweruser.forum

[–] Sea_Particular_4014@alien.top 1 points 2 years ago (2 children)

I'd try Goliath 120B and lzlv 70B. Those are the absolute best I've used, assuming you're doing story writing / RP and stuff.

LZLV should be speedy as can be and easily done in VRAM.

Goliath won't quite fit at 4 bit but you could do lower precision or sacrifice some speed and do q4_k_m GGUF with most of the layers offloaded. That'd be my choice, but I have a high tolerance for slow generation.