this post was submitted on 26 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it's possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?

Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context 'costs' more memory?

I'm considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?

you are viewing a single comment's thread
view the rest of the comments
[–] a_beautiful_rhind@alien.top 1 points 9 months ago (2 children)

I see it being ~2GB per every 4k from what llama.cpp spits out. Load a model and read what it puts in the log.

As to mac vs RTX. You can build a system with the same or similar amount of vram as the mac for a lower price but it depends on your skill level and electricity/space requirements.

If you live in a studio apartment, I don't recommend buying an 8 card inference server, regardless of the couple $1000 in either direction and the faster speed.

[–] EvokerTCG@alien.top 1 points 9 months ago

Thanks. Yes, a 2kW heater pc would only be welcome in the winter, and could get pricy to run.

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (1 children)

What worthwhile multicard inference server is 6000 usd or less?

[–] a_beautiful_rhind@alien.top 1 points 9 months ago (1 children)

How much multicard? You can get away with single CPU EPYC boards for 4 and below. For more you need those supermicro 4028/4029 big guns. The older one is still $1000 and under.

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (1 children)

You mention the CPU only for the PCI lanes right? I don't want to run avx-512 anymore than I have to.

All-in-all the best deal for under 6k I've found is:

A Lenovo P920 with dual 3rd gen platinums and an rtx a6000 48GB. That should include about 256GB of DDR4.

Sure I could go rtx 8000 and use the remaining thousand to buy some arc a770 for 16GB and below models to improve on the agent. But I do prefer to stay as close to modern as possible with the GPU features.

I'm assuming you were thinking p40s with the recommendations you've made?

[–] a_beautiful_rhind@alien.top 1 points 9 months ago (1 children)

P40 is getting long in the tooth but there is nothing that beats the price. I keep looking at what else I could buy that gives decent performance and realize it's it or 3090. I really really wish it was faster or had an exllama that could just push it to 12-13 t/s

intel + nvidia won't cooperate so you'll have to have different environments and while that's great for running encapsulated things like TTS, it sorta sucks for training or trying bigger models. Has kept me from buying the cheap Mi25. Otherwise they will mailny sit and eat idle watts for nothing as I found out with my extra P40. 3x24 covers most LLM. 4th card gets used for SD/TTS/whatever and 5th card stays fallow.

With a workstation like a P920, you are really only gaining the ram capacity. The point of those big supermicros is so you can fit more than 2 cards at full speed.

If you are just going to spend on A6000 or RTX8000, then almost anything that can do at least 128g of ram will be enough. I would be more inclined with 6k to cobble together an epyc board with a mining case as then I have single CPU, all the ram I want and at least 4 x16 slots.

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago

Well my issues with the p40 are mainly that it's not fp16-capable, bandwidth is super low at only 346GB/s, and it requires elongation due to the added blower fan from probably ebay. I do agree it's a big brain move for budget builders though.

My issue with 24GB cards mainly stems from transferring data between two cards over PCI-e. We know that a 70b on one 48GB vs 2x 24GB will perform consistently better. Again, it's really negligible if you only have the budget for a dual 3090 build or something.

I do work extensively with OpenVINO and ONNX as a software developer so I'm not too worried about any issues with the platforms working together (I've managed to make them play nice one way or another for most things). This is actually why I was leaning more into the dual Xeon platinums or golds instead of the Epyc/Threadripper deal. PCI lanes are plentiful either way though.

For the P920, the goal would mainly be to just have a q4 or q5 70b run on the 48GB but auxiliary models like embedding fetchers, semantic stuff, QA, etc. would be on something like an a770 due to the specs-to-price ratio. I don't really need the RAM and I logically figured that I wouldn't need more than 64GB dedicated to ML functions since even AVX-512 won't make up for the slowness of running something larger, imo.

I can only see myself having more than two cards in a machine working together if I could include Nvlink or something.

Eventually most of the. Things I make will be going to prod, so I also need to make sure I keep in-mind that I'm more likely to get a good deal on cloud xeons like sapphire rapids and a single big card vs an epyc with many smaller cards.