this post was submitted on 27 Nov 2023

1 points (100.0% liked)

LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Quantizing 70b models to 4-bit, how much does performance degrade? (alien.top)

submitted 2 years ago by ae_dataviz@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

top 22 comments

sorted by: hot top controversial new old

[–] Herr_Drosselmeyer@alien.top 1 points 2 years ago

It's a rule of thumb that yes, higher parameter at low quant beats lower parameter at high quant (or no quant) but take it with a grain of salt as you may still prefer a lower parameter model that's more tuned for the task you prefer.

[–] daHaus@alien.top 1 points 2 years ago

This seems like something that would be difficult to predict considering how fundamental what your changing is. The method you use to quantize it and how refined it is also matters a great deal.

[–] semicausal@alien.top 1 points 2 years ago (1 children)

In my experience, the lower you go....the model:

- hallucinates more (one time I asked Llama2 what made the sky blue and it freaked out and generated thousands of similar questions line by line)

- is more likely to give you an inaccurate response when it doesn't hallucinate

- is significantly more unreliable and non-deterministic (seriously, providing the same prompt can cause different answers!)

At the bottom of this post, I compare the 2-bit and 8-bit extreme ends of Code Llama Instruct model with the same prompt and you can see how it played out: https://about.xethub.com/blog/comparing-code-llama-models-locally-macbook

[–] NachosforDachos@alien.top 1 points 2 years ago

That was useful and interesting.

Speaking of hypothetical situations how much money do you think an individual would need to buy the computing power needed to provide themselves with a gpt 4 turbo like experience locally?

[–] Sea_Particular_4014@alien.top 1 points 2 years ago (1 children)

Adding into Automata's theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it'll depend on your task.

It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you're doing programming or data classification or similar you might be better off with a high precision smaller model that's been fine-tuned towards the task at hand.

I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn't find the 70b that much better (or better at all) were doing programming or data classification type stuff.

[–] AnOnlineHandle@alien.top 1 points 2 years ago (3 children)

What sort of vram is needed to run a 4bit 70B model?

[–] Sea_Particular_4014@alien.top 1 points 2 years ago

Well... none at all if you're happy with ~1 token per second or less using GGUF CPU inference.

I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

You'd need 2 x 3090 or an A6000 or something to do it quickly.

[–] Dusty_da_Cat@alien.top 1 points 2 years ago

The golden standard is 2 x 3090/4090 cards, which is 48 GBs of VRAM total. You can get by with 2 P40s(Need cooling solution) and run onboard video, if you want to save some money. The speeds will be slower, but still better than running on System RAM on typical setups.

[–] yeawhatever@alien.top 1 points 2 years ago (2 children)

about 44 GB

[–] harrro@alien.top 1 points 2 years ago (1 children)

Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

[–] Dry-Vermicelli-682@alien.top 1 points 2 years ago (1 children)

So you have 2 GPUs on single m/b.. and the llama.cpp thing knows to use both? Does this work with AMD GPUs too?

[–] harrro@alien.top 1 points 2 years ago

Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

Not sure on AMD support but for nvidia it's pretty easy to do.

[–] Dry-Vermicelli-682@alien.top 1 points 2 years ago (1 children)

44GB of GPU VRAM? WTH GPU has 44GB other than stupid expensive ones? Are average folks running $25K GPUS at home? Or those running these like working for company's with lots of money and building small GPU servers to run these?

[–] MiniEval_@alien.top 1 points 2 years ago (1 children)

Dual 3090/4090s. Still pricey as hell, but not out of reach for some folks.

[–] Dry-Vermicelli-682@alien.top 1 points 2 years ago (1 children)

So anyone wanting to play around with this at home, has to expect to drop about 4K or so for GPUs and a setup?

[–] drifter_VR@alien.top 1 points 2 years ago

I can get 2 3090 for 1200€ here on the second-hand market

[–] Secret_Joke_2262@alien.top 1 points 2 years ago (1 children)

A friend told me that for 70b when using q4, performance drops by 10%. The larger the model, the less it suffers from weight quantization

[–] Nkingsy@alien.top 1 points 2 years ago

Or the more undertrained it is, the more fat can be trimmed

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

70b 4bit will eat those small models for breakfast.

[–] tu9jn@alien.top 1 points 2 years ago (1 children)

Usually number of parameters matter more than bit per weight, but I had some problems with really low bpw models like 70b 2.55bpw exllamav2.

34b Yi could be a good compromise, I am impressed with it, and it has a long context length as well.

[–] AutomataManifold@alien.top 1 points 2 years ago

Early research suggested that there was an inflection point below 4-bits, where things got markedly worse. In my personal use, I find that accuracy definitely suffers below there, though maybe modern quants are a bit better at it.

34B Yi does seem like a sweet spot, though I'm starting to suspect that we need some fine-tunes that use longer stories as part of the training data, because it doesn't seem to be able to maintain the quality for the entire length of the context. Still, being able to include callbacks to events from thousands of tokens earlier is impressively practical. I've been alternating between a fast 13B (for specific scenes), 34B Yi (for general writing), and 70B (for when you need it to be smart and varied). And, of course, just switching models can help with the repetition sometimes.

[–] Ion_GPT@alien.top 1 points 2 years ago

Depending on the task. For anything related to multilingual, like translating, the quant will destroy the model. I suspect that this is because the sampling data used during the process is all English.