LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Quantizing 70b models to 4-bit, how much does performance degrade? (alien.top)

submitted 11 months ago by ae_dataviz@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

you are viewing a single comment's thread
view the rest of the comments

[–] AutomataManifold@alien.top 1 points 11 months ago

Early research suggested that there was an inflection point below 4-bits, where things got markedly worse. In my personal use, I find that accuracy definitely suffers below there, though maybe modern quants are a bit better at it.

34B Yi does seem like a sweet spot, though I'm starting to suspect that we need some fine-tunes that use longer stories as part of the training data, because it doesn't seem to be able to maintain the quality for the entire length of the context. Still, being able to include callbacks to events from thousands of tokens earlier is impressively practical. I've been alternating between a fast 13B (for specific scenes), 34B Yi (for general writing), and 70B (for when you need it to be smart and varied). And, of course, just switching models can help with the repetition sometimes.