this post was submitted on 20 Nov 2023
1 points (100.0% liked)
LocalLLaMA
1 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
If you're only getting 0.1 then you've probably overshot your layer offloading.
I can get up to 1.5 t/s with a 3090, at 5_K_M
Try running Llama.cpp from the command line with 30 layers offloaded to the gpu, and make sure your thread count is set to match your (physical) CPU core count
The other problem you're likely running into is that 64gb of RAM is cutting it pretty close. Make sure your base OS usage is below 8GB if possible and try memory locking the model on load. The problem is that with that amount of system ram, its possible you have other applications running causing the OS to page the model data out to disk, which kills performance
Holy crap...
Yeah… I thought I’ll be at least “in the room “ buying my setup last year, but it turns out I’m outside in the gutter 🫣😢