this post was submitted on 08 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Here's a 7B llama.cpp bench on a 3090 and 7800X3D, with CL28 DDR5 6000 RAM.
All layers offloaded to GPU:
And here is just 2/35 layers offlloaded to CPU:
As you can see, the moment you offload even a little bit to CPU, you are going to hit performance hard. More than a few layers and the hit is very severe.
Here is exllamav2 for reference, though the time also includes prompt processing so its actually faster than indicated:
Thanks for your data! Can you do the test again with the phind codellama 34B model?