LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Looking for CPU Inference Hardware (8 Channel Ram Server Motherboards) (alien.top)

submitted 2 years ago by jasonmbrown@alien.top to c/localllama@poweruser.forum

10 comments fedilink hide all child comments

Just wondering if anyone with more knowledge on server hardware could point me in the direction of getting an 8 channel ddr4 server up and running (Estimated bandwidth speed is around 200gb/s) So I would think it would be plenty for inferencing LLM's.
I would prefer to go used Server hardware due to price, when comparing the memory amount to getting a bunch of p40's the power consumption is drastically lower. Im just not sure how fast a slightly older server cpu can process inferencing.

If I was looking to run 80-120gb models would 200gb/s and dual 24 core cpu's get me 3-5 tokens a second?

you are viewing a single comment's thread
view the rest of the comments

[–] mcmoose1900@alien.top 1 points 2 years ago (1 children)

A big issue for CPU only setups is prompt processing. They're kind of OK for short chats, but if you give them full context the processing time is miserable. Nowhere close to 5 tok/sec.

There is one exception: the Xeon Max with HBM. It is not cheap.

So if you get a server, at least get a small GPU with it to offload prompt processing.

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

A big issue for CPU only setups is prompt processing. They're kind of OK for short chats, but if you give them full context the processing time is miserable. Nowhere close to 5 tok/sec.

That's where context shifting comes into play. So the entire context doesn't have to be reprocessed over and over again. Just the changes.