this post was submitted on 27 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Well... none at all if you're happy with ~1 token per second or less using GGUF CPU inference.
I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.
You'd need 2 x 3090 or an A6000 or something to do it quickly.