this post was submitted on 27 Nov 2023
1 points (100.0% liked)
LocalLLaMA
4 readers
4 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Well, if you use llama.cpp and https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGUF model, and the Q5_K_M quantisation file, it uses 51.25 GB of memory. So your 24GB card can take less than half the layers of that file. So I guess if your offloading < half the layers to the graphics card, then it will be less than twice as fast as CPU only. Have you tried a quantised model like that with CPU only?