LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Quantizing 70b models to 4-bit, how much does performance degrade? (alien.top)

submitted 2 years ago by ae_dataviz@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

you are viewing a single comment's thread
view the rest of the comments

[–] Sea_Particular_4014@alien.top 1 points 2 years ago (11 children)

Adding into Automata's theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it'll depend on your task.

It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you're doing programming or data classification or similar you might be better off with a high precision smaller model that's been fine-tuned towards the task at hand.

I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn't find the 70b that much better (or better at all) were doing programming or data classification type stuff.

[–] AnOnlineHandle@alien.top 1 points 2 years ago (10 children)

What sort of vram is needed to run a 4bit 70B model?

[–] Sea_Particular_4014@alien.top 1 points 2 years ago

Well... none at all if you're happy with ~1 token per second or less using GGUF CPU inference.

I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

You'd need 2 x 3090 or an A6000 or something to do it quickly.

load more comments (9 replies)