this post was submitted on 26 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it's possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?

Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context 'costs' more memory?

I'm considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?

top 11 comments
sorted by: hot top controversial new old
[–] a_beautiful_rhind@alien.top 1 points 9 months ago (2 children)

I see it being ~2GB per every 4k from what llama.cpp spits out. Load a model and read what it puts in the log.

As to mac vs RTX. You can build a system with the same or similar amount of vram as the mac for a lower price but it depends on your skill level and electricity/space requirements.

If you live in a studio apartment, I don't recommend buying an 8 card inference server, regardless of the couple $1000 in either direction and the faster speed.

[–] EvokerTCG@alien.top 1 points 9 months ago

Thanks. Yes, a 2kW heater pc would only be welcome in the winter, and could get pricy to run.

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (1 children)

What worthwhile multicard inference server is 6000 usd or less?

[–] a_beautiful_rhind@alien.top 1 points 9 months ago (1 children)

How much multicard? You can get away with single CPU EPYC boards for 4 and below. For more you need those supermicro 4028/4029 big guns. The older one is still $1000 and under.

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago (1 children)

You mention the CPU only for the PCI lanes right? I don't want to run avx-512 anymore than I have to.

All-in-all the best deal for under 6k I've found is:

A Lenovo P920 with dual 3rd gen platinums and an rtx a6000 48GB. That should include about 256GB of DDR4.

Sure I could go rtx 8000 and use the remaining thousand to buy some arc a770 for 16GB and below models to improve on the agent. But I do prefer to stay as close to modern as possible with the GPU features.

I'm assuming you were thinking p40s with the recommendations you've made?

[–] a_beautiful_rhind@alien.top 1 points 9 months ago (1 children)

P40 is getting long in the tooth but there is nothing that beats the price. I keep looking at what else I could buy that gives decent performance and realize it's it or 3090. I really really wish it was faster or had an exllama that could just push it to 12-13 t/s

intel + nvidia won't cooperate so you'll have to have different environments and while that's great for running encapsulated things like TTS, it sorta sucks for training or trying bigger models. Has kept me from buying the cheap Mi25. Otherwise they will mailny sit and eat idle watts for nothing as I found out with my extra P40. 3x24 covers most LLM. 4th card gets used for SD/TTS/whatever and 5th card stays fallow.

With a workstation like a P920, you are really only gaining the ram capacity. The point of those big supermicros is so you can fit more than 2 cards at full speed.

If you are just going to spend on A6000 or RTX8000, then almost anything that can do at least 128g of ram will be enough. I would be more inclined with 6k to cobble together an epyc board with a mining case as then I have single CPU, all the ram I want and at least 4 x16 slots.

[–] Prudent-Artichoke-19@alien.top 1 points 9 months ago

Well my issues with the p40 are mainly that it's not fp16-capable, bandwidth is super low at only 346GB/s, and it requires elongation due to the added blower fan from probably ebay. I do agree it's a big brain move for budget builders though.

My issue with 24GB cards mainly stems from transferring data between two cards over PCI-e. We know that a 70b on one 48GB vs 2x 24GB will perform consistently better. Again, it's really negligible if you only have the budget for a dual 3090 build or something.

I do work extensively with OpenVINO and ONNX as a software developer so I'm not too worried about any issues with the platforms working together (I've managed to make them play nice one way or another for most things). This is actually why I was leaning more into the dual Xeon platinums or golds instead of the Epyc/Threadripper deal. PCI lanes are plentiful either way though.

For the P920, the goal would mainly be to just have a q4 or q5 70b run on the 48GB but auxiliary models like embedding fetchers, semantic stuff, QA, etc. would be on something like an a770 due to the specs-to-price ratio. I don't really need the RAM and I logically figured that I wouldn't need more than 64GB dedicated to ML functions since even AVX-512 won't make up for the slowness of running something larger, imo.

I can only see myself having more than two cards in a machine working together if I could include Nvlink or something.

Eventually most of the. Things I make will be going to prod, so I also need to make sure I keep in-mind that I'm more likely to get a good deal on cloud xeons like sapphire rapids and a single big card vs an epyc with many smaller cards.

[–] FullOf_Bad_Ideas@alien.top 1 points 9 months ago (1 children)

Formula to calculate kv cache, as in space used by context

batch_size * seqlen * (d_model/n_heads) * n_layers * 2 (K and V) * 2 (bytes per Float16) * n_kv_heads

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

This blog post is really good, I recommend you to read it.

Usually bigger models have more layers, heads and dimensions, but I am not sure whether heads or dimensions grow faster. It's something you can look up though.

[–] EvokerTCG@alien.top 1 points 9 months ago (1 children)

Thanks. I would guess the seqlen is the sum of the input and output length as it feeds back on itself.

[–] FullOf_Bad_Ideas@alien.top 1 points 9 months ago

yeah, it will be a sum of tokens that the next token is generated on. I don't know how often KV cache is updated.

[–] BoshiAI@alien.top 1 points 9 months ago

Potentially dumb but related question:

I know the Mac M* Series chips can use up to ~70% of their universal RAM for "GPU" (VRAM) purposes. The 20GB used to load up a Yi-34B model just about uses all of that up.

So: given I still have maybe 8GB of remainder RAM to work with (assuming I leave 4GB for the system), would I be able to apply a 128K context buffer and have that located in "normal" RAM?

I'm assuming the heavy computational load is performed on the inferencing itself, and the model itself would be loaded in "VRAM" and the GPU side of the chip handles that - but can the context buffer be loaded and work at a decent speed in the remaining RAM? Or does everything - the context buffer and model - both have to use "VRAM" to work at a decent speed?