LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Quantizing 70b models to 4-bit, how much does performance degrade? (alien.top)

submitted 11 months ago by ae_dataviz@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

you are viewing a single comment's thread
view the rest of the comments

[–] Sea_Particular_4014@alien.top 1 points 11 months ago (11 children)

Adding into Automata's theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it'll depend on your task.

It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you're doing programming or data classification or similar you might be better off with a high precision smaller model that's been fine-tuned towards the task at hand.

I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn't find the 70b that much better (or better at all) were doing programming or data classification type stuff.

[–] AnOnlineHandle@alien.top 1 points 11 months ago (10 children)

What sort of vram is needed to run a 4bit 70B model?

[–] Dusty_da_Cat@alien.top 1 points 11 months ago

The golden standard is 2 x 3090/4090 cards, which is 48 GBs of VRAM total. You can get by with 2 P40s(Need cooling solution) and run onboard video, if you want to save some money. The speeds will be slower, but still better than running on System RAM on typical setups.

load more comments (9 replies)