this post was submitted on 04 Dec 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Right now it seems we are once again on the cusp of another round of LLM size upgrades. It appears to me that having 24gb VRAM gets you access to a lot of really great models, but 48gb VRAM really opens the door towards the impressive 70B models and allows you to nicely run the 30B models. However, im seeing more and more 100B+ models being created that push the 48 gb VRAM specs down into lower quants if they are able to run the model at all.

this is in my opinion is big, because 48gb is currently the magic number for in my opinion consumer level cards, 2x 3090's or 2x 4090s. adding an extra 24gb to a build via consumer GPUs turns into a monumental task due to either space in the tower or capabilities of the hardware AND it would put you at 72gb VRAM putting you at the very edge of the recommended VRAM for the 120GB 4KM models.

I genuinely don't know what i am talking about and i am just rambling, because i am trying to wrap my head around HOW to upgrade my vram to load the larger models without buying a massively overpriced workstation card. should i stuff 4 3090's into a large tower? settle up 3 4090's in a rig?

how can the average hobbyist make the jump from 48gb to 72gb+?

is taking the wait and see approach towards nvidia dropping new scalper priced high VRAM cards feasible? Hope and pray for some kind of technical magic that drops the required VRAM while simultaneously keeping quality?

the reason i am stressing about this and asking for advice is because the quality difference between smaller models and 70B models is astronomical. and the difference between the 70B models and the 100+B models is a HUGE jump too. from my testing it seems that the 100B+ models really turn the "humanization" of the LLM up to the next level, leaving the 70B models to sound like...well.. AI.

I am very curious to see where this gets to by the end of 2024, but for sure.... i won't be seeing it on a 48gb VRAM set up.

top 20 comments
sorted by: hot top controversial new old
[–] fediverser@alien.top 1 points 11 months ago

This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

[–] unculturedperl@alien.top 1 points 11 months ago

Speed costs money, how fast can you afford to go?

Why 72gb? 80 or 96 seems like a more reasonable number. H100's have 80gb models if you can afford it ($29k?). Two A6000 adas would be $15k (plus a system to put them in).

The higher end compute cards seem more limited by funds and production than anything, X090 cards are where you find more scalpers and their ilk.

[–] bick_nyers@alien.top 1 points 11 months ago

ITT there are people discussing making a jump to Threadripper etc. to afford PCIE lanes.

Alternatively, pick up a Zen 2 EPYC on eBay for cheap. 16 core CPU + Motherboard could run you around $500 and you can get 6 PCIE 4.0 x16. Check motherboard specs and learn more about using server hardware (loud fans!) via ServeTheHome and Art of Server.

Saw something a while back that GDDR7 will have something like 33% more memory per chip, so if the bus width stays the same we are looking at a 32GB 5090. Keep in mind this will be PCIE 5.0.

[–] synn89@alien.top 1 points 11 months ago

Building a system that supports two 24GB cards doesn't have to cost a lot. Boards that can do dual 8x PCI and cases/power that can handle 2 GPUs isn't very hard. The problem I see past that is you're running into much more exotic/expensive hardware. AMD Threadripper comes to mind, which is a big price jump.

Given that the market of people that can afford that is much lower than dual card setups, I don't feel like we'll see the lion's share of open source happening at that level. People tend to tinker on things that are likely to get used by a lot of people.

I don't really see this changing much until AMD/Intel come out with graphics cards that bust the consumer card 24GB barrier to compete with Nvidia head on in the AI market. Right now Nvidia won't do that, as to not compete with their premium priced server cards.

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

The easiest thing to do is to get a Mac Studio. It also happens to be the best value. 3x4090s at $1600 each is $4800. That's just for the cards. Adding a machine to put those cards into will cost another few hundred dollars. Just the cost of 3x4090s put you into Mac Ultra 128GB range. Adding the machine to put those cards into puts you in Mac Ultra 192GB range. With those 3x4090s you only have 72GB of RAM. Both those Mac options give you much more RAM.

[–] Bod9001@alien.top 1 points 11 months ago

If you want to run a general purpose model that can do everything fair enough throw resources at it, But I feel like there's a lot of optimisation that can be done, e.g coding model doesn't need to know how to fill out tax returns or who won the European Cup in 1995–96, and it's even possible to do maybe even optimisations in size without any loss

[–] kingp1ng@alien.top 1 points 11 months ago

I keep hearing *unsubstantiated* rumors about model optimization breakthroughs. Everyone knows that the cost of compute is too damn high.

So I'm just waiting until the next performance improvements arrive. Three years ago, a 1B param model was state of the art. Hopefully by the next year, they'll be a model and framework which cuts the compute cost by half.

[–] DominicanGreg@alien.top 1 points 11 months ago

Parts wise, a threadripper + ASUS Pro WS WRX80E-SAGE SE WiFi II is already a 2k price floor.

each 4090 is 2-2.3k

each 3090 is 1-1.5k

so building a machine from scratch will run you easily 8-10k off 4090's and 6-8k off 3090's. If you already have some GPUS or parts you would still problaly need 2 or more extra gpu's plus the space and power to run them.

to my specific situation i would have to grab the treadripper, mobo, a case, ram, 2 more cards im looking at potentially 5-7k worth of damage. OR.... pay 8.6k for a mac pro m2 and get an entire extra machine to play with.

There's definitely an entire Mac Pro M3 series on the way considering they just released the laptops, it's only a matter of time for them to shoot out the announcements. So i would definitely feel a bit peeved if i bought the M2 tower only for a month or two later apple to release the m3 versions.

[–] ModeradorDoFariaLima@alien.top 1 points 11 months ago

Wish we could just solder more VRAM to the cards. Such a silly thing to keep holding us back.

[–] corecursion0@alien.top 1 points 11 months ago

The next gen of models are in the 110B mark and beyond. I would say, estimate what it takes to do 250B at FP8 and FP16, then structure your purchases accordingly. Favour high bandwidth memory.

[–] tylerbeefish@alien.top 1 points 11 months ago

Your wait and see approach is probably wise. The newly released GH200 chip leapfrogs the H100 by a considerable margin, which was already smoking the A100.

On the consumer side, there does not seem to be a high demand to run local LLM. However, I used a 7b model with GPT4All on my ultrabook from 2014 which has a low-tier intel 6th gen with 16gb ram and was getting about 2.5 tokens/second. It was super slow but just shows what would be possible with some optimizations on consumer hardware.

If you’re willing to spend $10k to run an esoteric 110b model, it might be worthwhile to go for the capability to train them in the first place (even if perhaps very slowly). Or, consider a mac with large amounts of memory that’s built into the soc (unified memory) which would likely run models at an acceptable rate with some optimizations. Of course, if blistering performance isn’t necessary.

Otherwise, patience will likely have some good results in the context of a solid model which works on consumer-grade components. The space seems keen on allowing general users and enabling alternatives to transmitting data to some random server elsewhere. Opinion.

[–] Flying_Madlad@alien.top 1 points 11 months ago

I think the future is modular. Many small machines contributing to hosting a bigger model.

That way if you need to upgrade the capacity of your system you can just add another compute node

[–] MindOrbits@alien.top 1 points 11 months ago

Yes.

Workstations are the way to go. There are a few motherboards out there that give you four 2x wide slots.

Pro tip: think in pcix3 terms, 16x lanes (pcix3) is a sought after baseline. 8x lanes often preform about 80% of 16x, often due other system limitations are the bottlenecks, not the pci bus.

Depending on cpu, motherboard chipset, and internal lane connections, you will struggle to find four 16x slots.

PCI 4.0 adds to the mess, but always in your benefit, just not as much as you might think Depending on the above.

Older cards 3.0 Most cards you consider modern good and better 4.0 New cards 5.0

4.0 lanes can be split by chipsets for things like nvme drives and usb. And is 2x the bandwidth of pcix3 with supported 4.0 devices. (8x pcix4 ~ 16x pcix3) A nice motherboard feature is when 16x pice4 lanes are split into two 16x pciex3 slots. Chipsets and nvme drives benefit greatly from pcix4 and often free up more pcix3 lanes for the slots.

So... if you find four pci double wide slots with at least 8x lanes per slot your leaving some performance 'on the table' but your really not that handicapped by the loss for what you buying, especially when shopping used.

Really new cards would suffer more from lane saturation, and may not have a favorable cost to benefit due to newer cards price.

[–] -Automaticity@alien.top 1 points 11 months ago

If Nvidia isn't upgrading GPU's past 24GB for the RTX 50 series then that will probably factor into the open source community keeping models below 40b parameters. I don't know the exact cutoff point. A lot of people with 12gb VRAM can run 13b models but you could also run 7b 8-bit with 16k context size. It will get increasingly difficult to run larger contexts with larger models.

Some larger open models are being released but there won't be much community there to train on a bunch of datasets to the huge models to nail the ideal finetune.

[–] Rutabaga-Agitated@alien.top 1 points 11 months ago

We created a 4x 4090 RTX setup through a mining rig That is 96gb VRAM for round about 10k... does not get cheaper than that. Best compute per cost rn I think.

https://preview.redd.it/nfq4olntq54c1.png?width=1812&format=pjpg&auto=webp&s=a5308bb5eec778072f8d6a394b5243ca33c7fd87

.*

[–] AutomataManifold@alien.top 1 points 11 months ago

I think its work remembering that while the really big models take a lot of VRAM, they also quantize down to smaller sizes, so the numbers are slightly misleading.

[–] wind_dude@alien.top 1 points 11 months ago

Or 2 a6000s. But yea $$$ matters.

[–] a_beautiful_rhind@alien.top 1 points 11 months ago

I'm not getting a super huge jump with the bigger models yet. Just a mild bump. I got a P100 to load the low 100s and have exllama work. That's 64g of FP16 using vram.

For bigger I can use FP32 and put back the 2 more P40s. That's 120g of vram. Also 6 vidya cards :P

It required building for this type of system from the start. I'm not made of money either, I just upgrade it over time.

[–] nero10578@alien.top 1 points 11 months ago

You don’t NEED 3090/4090s. A 3x Tesla P40 setup still streams at reading speed running 120b models.

[–] nostriluu@alien.top 1 points 11 months ago

I have two questions:

what's this going to look like in six months, with new Intel, AMD, ARM/RISC UMA, hybrid designs well supported and 7200mt+ DDR5 common?

Are the high memory models that much better? My impression is you get a lot of reliable utility out of good smaller models, from there it's diminishing returns.

I had a honking system with two 3090s, but it felt a bit boondoggle-ish, I sold it and my current plan is to get something like a 4060ti-16gb and also use OpenAI's API, so I can wait to see what develops, rather than spending it all now while it's still early days. I can see how someone who is really developing LLMs would want more, but as a "consumer" this seems reasonable.

Even for the "just get a Mac studio," it seems like the M3 can use more VRAM and is more optimized, so worth it to wait until the M3 Ultra comes out, unless you can get a bargain bin previous model.