this post was submitted on 21 Nov 2023

1 points (100.0% liked)

LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

ExLlamaV2: The Fastest Library to Run LLMs (towardsdatascience.com)

submitted 2 years ago by alchemist1e9@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

Is this accurate?

top 22 comments

sorted by: hot top controversial new old

[–] ModeradorDoFariaLima@alien.top 1 points 2 years ago (1 children)

Too bad I think that Windows support for it was lacking (at least, last time I checked it). It needs a separate thing to make it work properly, and this thing was only for Linux.

[–] ViennaFox@alien.top 1 points 2 years ago

It works fine for me. I am also using a 3090 and text-gen-webui like Liquiddandruff.

[–] tgredditfc@alien.top 1 points 2 years ago (1 children)

In my experience it’s the fastest and llama.cpp is the slowest.

[–] randomfoo2@alien.top 1 points 2 years ago (1 children)

I think ExLlama (and ExLlamaV2) is great and EXL2's ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don't think it's quite so cut and dry.

For those looking for max batch=1 perf, I'd highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!

My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

[–] tgredditfc@alien.top 1 points 2 years ago

Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.

[–] CardAnarchist@alien.top 1 points 2 years ago

Can you offload layers with this like GGUF?

I don't have much VRAM / RAM so even when running a 7B I have to partially offload layers.

[–] llama_in_sunglasses@alien.top 1 points 2 years ago

I've tested pretty much all of the available quantization methods and I prefer exllamav2 for everything I run on GPU, it's fast and gives high quality results. If anyone wants to experiment with some different calibration parquets, I've taken a portion of the PIPPA data and converted it into various prompt formats, along with a portion of the synthia instruction/response pairs that I've also converted into different prompt formats. I've only tested them on OpenHermes, but they did make coherent models that all produce different generation output from the same prompt.

https://desync.xyz/calsets.html

[–] JoseConseco_@alien.top 1 points 2 years ago

So how much vram would be required for 34b model or 14b model? I assume no cpu offloading right? With my 12gb vram, I guess I could only feed 14bilion parameters models, maybe even not that.

[–] CasimirsBlake@alien.top 1 points 2 years ago

No chance of running this on P40s any time soon?

[–] kpodkanowicz@alien.top 1 points 2 years ago

It's not just great. It's a piece of art.

[–] beezbos_trip@alien.top 1 points 2 years ago (1 children)

Does it run on Apple Silicon?

[–] intellidumb@alien.top 1 points 2 years ago

Based on the releases, doesn’t look like it. https://github.com/turboderp/exllamav2/releases

[–] SomeOddCodeGuy@alien.top 1 points 2 years ago

I wish there was support for metal with ExLlamav2. :(

[–] mlabonne@alien.top 1 points 2 years ago (1 children)

I'm the author of this article, thank you for posting it! If you don't want to use Medium, here's the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html

[–] ReturningTarzan@alien.top 1 points 2 years ago (1 children)

I'm a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn't really require flash-attn-2 to run "properly", it just runs a little better that way. But it's perfectly usable without it.

Great article, though. thanks. :)

[–] mlabonne@alien.top 1 points 2 years ago

Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it's still the case? I updated these two points, thanks for your feedback.

[–] a_beautiful_rhind@alien.top 1 points 2 years ago

Hey he finally gets some recognition.

[–] MonkeyMaster64@alien.top 1 points 2 years ago

Is this able to use CPU (similar to llama.cpp)?

[–] Darius510@alien.top 1 points 2 years ago (1 children)

God I cant wait until we’re past the command line era of this stuff

[–] fallingdowndizzyvr@alien.top 1 points 2 years ago

I'm the opposite. I shun everything LLM that isn't command line when I can. Everything has it's place. When dealing with media, GUI is the way to go. But when dealing with text, command line is fine. I don't need animated pop up bubbles.

[–] lxe@alien.top 1 points 2 years ago (1 children)

Agreed. Best performance running GPTQ’s. Missing the HF samplers but that’s ok.

[–] ReturningTarzan@alien.top 1 points 2 years ago

I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don't personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can't be long before there's an update to expose those parameters in the UI.