this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

The title, pretty much.

I'm wondering whether a 70b model quantized to 4bit would perform better than a 7b/13b/34b model at fp16. Would be great to get some insights from the community.

you are viewing a single comment's thread
view the rest of the comments
[–] Sea_Particular_4014@alien.top 1 points 11 months ago (1 children)

Adding into Automata's theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it'll depend on your task.

It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you're doing programming or data classification or similar you might be better off with a high precision smaller model that's been fine-tuned towards the task at hand.

I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn't find the 70b that much better (or better at all) were doing programming or data classification type stuff.

[–] AnOnlineHandle@alien.top 1 points 11 months ago (3 children)

What sort of vram is needed to run a 4bit 70B model?

[–] yeawhatever@alien.top 1 points 11 months ago (2 children)
[–] harrro@alien.top 1 points 11 months ago (1 children)

Using Q3, you can fit it in 36GB (I have a weird combo of RTX 3060 with 12GB and P40 with 24GB and I can run a 70B at 3bit fully on GPU).

[–] Dry-Vermicelli-682@alien.top 1 points 11 months ago (1 children)

So you have 2 GPUs on single m/b.. and the llama.cpp thing knows to use both? Does this work with AMD GPUs too?

[–] harrro@alien.top 1 points 11 months ago

Yes llama.cpp will automatically split the model to work across GPUs. You can also specify how much of the full model should be on each GPU.

Not sure on AMD support but for nvidia it's pretty easy to do.

[–] Dry-Vermicelli-682@alien.top 1 points 11 months ago (1 children)

44GB of GPU VRAM? WTH GPU has 44GB other than stupid expensive ones? Are average folks running $25K GPUS at home? Or those running these like working for company's with lots of money and building small GPU servers to run these?

[–] MiniEval_@alien.top 1 points 11 months ago (1 children)

Dual 3090/4090s. Still pricey as hell, but not out of reach for some folks.

[–] Dry-Vermicelli-682@alien.top 1 points 11 months ago (1 children)

So anyone wanting to play around with this at home, has to expect to drop about 4K or so for GPUs and a setup?

[–] drifter_VR@alien.top 1 points 11 months ago

I can get 2 3090 for 1200€ here on the second-hand market

[–] Dusty_da_Cat@alien.top 1 points 11 months ago

The golden standard is 2 x 3090/4090 cards, which is 48 GBs of VRAM total. You can get by with 2 P40s(Need cooling solution) and run onboard video, if you want to save some money. The speeds will be slower, but still better than running on System RAM on typical setups.

[–] Sea_Particular_4014@alien.top 1 points 11 months ago

Well... none at all if you're happy with ~1 token per second or less using GGUF CPU inference.

I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

You'd need 2 x 3090 or an A6000 or something to do it quickly.