LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

ExLlamaV2: The Fastest Library to Run LLMs (towardsdatascience.com)

submitted 11 months ago by alchemist1e9@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

Is this accurate?

you are viewing a single comment's thread
view the rest of the comments

[–] tgredditfc@alien.top 1 points 11 months ago (2 children)

In my experience it’s the fastest and llama.cpp is the slowest.

[–] randomfoo2@alien.top 1 points 11 months ago (1 children)

I think ExLlama (and ExLlamaV2) is great and EXL2's ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don't think it's quite so cut and dry.

For those looking for max batch=1 perf, I'd highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!

My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

[–] tgredditfc@alien.top 1 points 11 months ago

Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.