LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Quantizing 70b models to 4-bit, how much does performance degrade? (alien.top)

submitted 11 months ago by ae_dataviz@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

you are viewing a single comment's thread
view the rest of the comments

[–] AnOnlineHandle@alien.top 1 points 11 months ago (10 children)

What sort of vram is needed to run a 4bit 70B model?

[–] yeawhatever@alien.top 1 points 11 months ago (7 children)

about 44 GB

[–] Dry-Vermicelli-682@alien.top 1 points 11 months ago (3 children)

44GB of GPU VRAM? WTH GPU has 44GB other than stupid expensive ones? Are average folks running $25K GPUS at home? Or those running these like working for company's with lots of money and building small GPU servers to run these?

[–] MiniEval_@alien.top 1 points 11 months ago (1 children)

Dual 3090/4090s. Still pricey as hell, but not out of reach for some folks.

[–] Dry-Vermicelli-682@alien.top 1 points 11 months ago (1 children)

So anyone wanting to play around with this at home, has to expect to drop about 4K or so for GPUs and a setup?

[–] drifter_VR@alien.top 1 points 11 months ago

I can get 2 3090 for 1200€ here on the second-hand market

load more comments (1 replies)

load more comments (4 replies)

load more comments (6 replies)