this post was submitted on 30 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Because it wouldn't be any faster than doing CPU inference. Since both CPUs and GPUs are already waiting around for data to process. It's that i/o that's the limiter. This changes none of that.
Is here a better way to use bigger models than can fit in RAM\VRAM? I'd want to try 70b or maybe even 120b but I only have 32\8gb.
70b? Q4, llama.cpp, some layers on gpu.
Might need to run Linux to get the system ram usage low enough