this post was submitted on 21 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Is this accurate?

you are viewing a single comment's thread
view the rest of the comments
[–] randomfoo2@alien.top 1 points 10 months ago (1 children)

I think ExLlama (and ExLlamaV2) is great and EXL2's ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don't think it's quite so cut and dry.

For those looking for max batch=1 perf, I'd highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!

My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

[–] tgredditfc@alien.top 1 points 10 months ago

Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.