this post was submitted on 29 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Dual CPUs would have terrible performance. This is because the processor is reading the whole model everytime its generating tokens and if you spread half the model onto a second CPU's memory then the cores in the first CPU would have to read that part of the model through the slow inter-CPU link. Vice versa with the second CPU's cores. llama.cpp would have to make a system to spread the workload across multi CPUs like they do across multi GPUs for this to work.
That's why you have 'numa' option in llama.cpp.
From my experience, number of memory channels do matter a lot so this mean that all memory sockets better be filled.
There is a NUMA aware option in llama.cpp