this post was submitted on 09 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
A big issue for CPU only setups is prompt processing. They're kind of OK for short chats, but if you give them full context the processing time is miserable. Nowhere close to 5 tok/sec.
There is one exception: the Xeon Max with HBM. It is not cheap.
So if you get a server, at least get a small GPU with it to offload prompt processing.
That's where context shifting comes into play. So the entire context doesn't have to be reprocessed over and over again. Just the changes.