this post was submitted on 25 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I believe that gpu offloading in llama.cpp can be used to merge your vram and ram. I would suggest you to try some airoboros llama 2 70b q3_k_m quant and Tess-m-1.3 q5_k_m once TheBloke makes quants. There will be some leftover space in your RAM after loading Tess, but it's a model with 200k context, so you will need it for context. Max out your vram and maybe use batch size of -1 to trade prompt processing speed for more vram space, try offloading both with cublas and clBLAST. Last time I checked, it seemed like using clBLAST allowed to offload more layers to gpu in the same memory footprint.