this post was submitted on 20 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I would suggest you to use Koboldcpp and run GGUF. A 70B Q5 model, with around 40 layers loaded into GPU, should have more than 1t/s. At least for me, I got 1.5t/s with 4090 and 64GB ram using Q5_K_M.
I could never get up and running on Linux with Nvidia. I used Kobold on Windows, but boy is it painful on Linux.
Well, I have never used Linux before since the main purpose of my pc is gaming. But I heard running LLMs on Linux is overall faster.
It is… but koboldcpp doesn’t have a executable for me to run :/