EvokerTCG

joined 11 months ago
 

I'm looking at costs for a large RAM CPU build and I want to share my ideas and get advice. I've seen others here say they've done something similar but without details.

So starting with the MB - I want a single CPU board with 12 memory channels with room for a GPU. The only one I can find is this supermicro for around $900. https://www.supermicro.com/en/products/motherboard/h13ssl-n

Gigabyte does have a 12 channel board but it has 24 DIMMs which would get in the way of a GPU. Are there other good options?

For RAM, 32gb 4800 server memory is about $140 each, so $1680 for 12, giving 384gb.

For CPU, the EPYC 9354 should be more than sufficient, and is around $3000.

[–] EvokerTCG@alien.top 1 points 11 months ago

Thanks. I can't find 'qualification samples' on ebay in the UK, unless you just find them through a serial number or something.

The DDR5 ram is more expensive, but it should hold value fairly well. I'll look for a 12 channel board.

[–] EvokerTCG@alien.top 1 points 11 months ago (2 children)

I want to keep my options open, and potentially have a large context, which can add up to 100GB to memory requirements.

I'm considering 1x genoa CPU with 12 channels. Something like the 9354 would be more than enough cores. I might start with a cheaper DDR4 machine first though.

How was it getting the Epyc machine set up? Are you using windows? What about a GPU?

[–] EvokerTCG@alien.top 1 points 11 months ago

Not true from what I've read here.

 

Continuing my quest to choose a rig with lots of memory, one possibility is dual socket MBs. Gen 1 to 3 EPYC chips have 8 channels of DDR4, so this gives 16 total memory channels, which is good bandwidth, if not beating GPUs, but can have way more memory (up to 1024GB). Builds with 64+ threads can be pretty cheap.

My questions are

  • Does the dual CPU setup cause trouble with running LLM software?
  • Is it reasonably possible to get windows and drivers etc working on 'server' architecture?
  • Is there anything else I should consider vs going for a single EPYC or Threadripper Pro?
 

So I'm looking into Threadripper pro systems, which can offer a pretty good memory bandwidth as they are 8 channel, and can have a huge amount of RAM. (I can put a 3090 or two in there too.)

I'm wondering how much the core count is going to affect performance. For example, the 5955WX has 16 cores while the 5995WX has 64 cores. They can both use the same memory though. There's little point spending extra if the limiting factor will be somewhere else.

[–] EvokerTCG@alien.top 1 points 11 months ago

Aside from repetition, isn't this effectively a new sampling method? You could call it Fuzzed Greedy Sampling.

[–] EvokerTCG@alien.top 1 points 11 months ago (1 children)

I meant in total, but there do seem to be models with up to 100GB for context, like 01-ai/Yi-34B-200K.

[–] EvokerTCG@alien.top 1 points 11 months ago

A valid option. I haven't looked into prices for renting but it could make sense unless I will use it a lot.

 

So I'm interested in applications that require memory more than speed, with high quality and a big context. I'm talking 100GB or more. Speed is still an important consideration. I don't need snappy conversations, but getting through more stuff 'overnight' is still valuable.

3090s are affordable, but it would take 4 to 8 to get into the big memory category, and the primary issue is energy use. For batch use the PC could shut down after finishing, so idle power use wouldn't be an issue. Are there motherboards that can completely shut off power to extra cards when they aren't needed?

Mac Studio M2 Ultra can get 192GB of unified memory, with about 140GB usable. This isn't as fast, obviously, but is meant to be acceptable for many applications.

What about PCs/servers with lots of mainboard RAM? Is this way slower than the Macs due to different architecture? If not it's probably a lot cheaper. The CPU would need to do all the work, and I don't know about how the energy efficiency would compare.

I would be grateful if anyone has data comparing speeds or joules per token for these broad options.

[–] EvokerTCG@alien.top 1 points 11 months ago (1 children)

Thanks. I would guess the seqlen is the sum of the input and output length as it feeds back on itself.

[–] EvokerTCG@alien.top 1 points 11 months ago

Thanks. Yes, a 2kW heater pc would only be welcome in the winter, and could get pricy to run.

 

I understand that a bigger memory means you can run a model with more parameters or less compression, but how does context size factor in? I believe it's possible to increase the context size, and that this will increase the initial processing before the model starts outputting tokens, but does someone have numbers?

Is memory for context independent on the model size, or does a bigger model mean that each bit of extra context 'costs' more memory?

I'm considering an M2 ultra for the large memory and low energy/token, although the speed is behind RTX cards. Is this the best option for tasks like writing novels, where quality and comprehension of lots of text beats speed?

 

So I'm considering getting a good LLM rig, and the M2 Ultra seems to be a good option for large memory, with much lower power usage/heat than 2 to 8 3090s or 4090s, albeit with lower speeds.

I want to know if anyone is using one, and what it's like. I've read that it is less supported by software which could be an issue. Also, is it good for Stable Diffusion?

Another question is about memory and context length. Does a big memory let you increase the context length with smaller models where the parameters don't fill the memory? I feel a big context would be useful for writing books and things.

Is there anything else to consider? Thanks.

[–] EvokerTCG@alien.top 1 points 11 months ago

I haven't tried Mac and don't know what the software ecosystem is like. Have you tried it or seen it working?

It looks like it doesn't have dedicated VRAM, but shared memory. I would guess this is slower than dedicated GPU memory but faster than RAM sticks on a normal PC?

 

So for background I've had some interest in LLMs and other AI for a year or so. I've used online LLMs like ChatGPT but haven't tried running my own due to 10 year old hardware. I'm considering getting a new PC and want to know whether to splash for one that can do high end LLM stuff.

I've read up a fair bit but have some questions that hopefully aren't too stupid.

1.) It looks like VRAM is the biggest hardware limit for model size. What are some good hardware options at different price points? Are there really expensive options that blow consumer stuff out of the water? Is now a good time to buy or is there something worth waiting for?
2.) Open source models seem to be dependent on the trainers giving away their expensively acquired work. Are you anticipating model releases to replace LLAMA2, and when?
3.) Is retraining or fine tuning possible for ordinary users? Is this meaningfully different from having a 'mission' or instruction set added to the beginning of each prompt/context? 3.) I think I understand parameter size and compression, but what determines the token context size a model can handle? GPT4s new massive context size is very handy.
4.) I'm interested in 'AutoGPT' type systems (or response + validation etc). Can this work in series mode, where you only have 1 model running a time? It seems like having specialised models could be useful. Would loading different models most suited to each particular 'subroutine' slow things down a lot? Are these systems difficult to set up or is it just a matter of feeding the output of one query into the input of the next (while adding on previous relevant context).
5.) Is the same type of hardware setup good for both LLMs and Stable Diffusion, or do they have separate setups for good bang/buck?

Many thanks to anyone who can help!