this post was submitted on 21 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Is this accurate?

top 22 comments
sorted by: hot top controversial new old
[–] ModeradorDoFariaLima@alien.top 1 points 11 months ago (1 children)

Too bad I think that Windows support for it was lacking (at least, last time I checked it). It needs a separate thing to make it work properly, and this thing was only for Linux.

[–] ViennaFox@alien.top 1 points 11 months ago

It works fine for me. I am also using a 3090 and text-gen-webui like Liquiddandruff.

[–] tgredditfc@alien.top 1 points 11 months ago (1 children)

In my experience it’s the fastest and llama.cpp is the slowest.

[–] randomfoo2@alien.top 1 points 11 months ago (1 children)

I think ExLlama (and ExLlamaV2) is great and EXL2's ability to quantize to arbitrary bpw, and its incredibly fast prefill processing I think generally makes it the best real-world choice for modern consumer GPUs, however, from testing on my workstations (5950X CPU and 3090/4090 GPUs) llama.cpp actually edges out ExLlamaV2 for inference speed (w/ a q4_0 beating out a 3.0bpw even) so I don't think it's quite so cut and dry.

For those looking for max batch=1 perf, I'd highly recommend people run their own benchmarks at home on their own system and see what works (also pay attention to prefill speeds if you often have long context)!

My benchmarks from a month or two ago: https://docs.google.com/spreadsheets/d/1kT4or6b0Fedd-W_jMwYpb63e1ZR3aePczz3zlbJW-Y4/edit#gid=1788227831

[–] tgredditfc@alien.top 1 points 11 months ago

Thanks for sharing! I have been struggling with llama.cpp loader and GGUF (using oobabooga and the same LLM model), no matter how I set the parameters and how many offloaded layers to GPUs, llama.cpp is way slower to ExLlama (v1&2), not just a bit slower but 1 digit slower. I really don’t know why.

[–] CardAnarchist@alien.top 1 points 11 months ago

Can you offload layers with this like GGUF?

I don't have much VRAM / RAM so even when running a 7B I have to partially offload layers.

[–] llama_in_sunglasses@alien.top 1 points 11 months ago

I've tested pretty much all of the available quantization methods and I prefer exllamav2 for everything I run on GPU, it's fast and gives high quality results. If anyone wants to experiment with some different calibration parquets, I've taken a portion of the PIPPA data and converted it into various prompt formats, along with a portion of the synthia instruction/response pairs that I've also converted into different prompt formats. I've only tested them on OpenHermes, but they did make coherent models that all produce different generation output from the same prompt.

https://desync.xyz/calsets.html

[–] JoseConseco_@alien.top 1 points 11 months ago

So how much vram would be required for 34b model or 14b model? I assume no cpu offloading right? With my 12gb vram, I guess I could only feed 14bilion parameters models, maybe even not that.

[–] CasimirsBlake@alien.top 1 points 11 months ago

No chance of running this on P40s any time soon?

[–] kpodkanowicz@alien.top 1 points 11 months ago

It's not just great. It's a piece of art.

[–] beezbos_trip@alien.top 1 points 11 months ago (1 children)

Does it run on Apple Silicon?

[–] intellidumb@alien.top 1 points 11 months ago

Based on the releases, doesn’t look like it. https://github.com/turboderp/exllamav2/releases

[–] SomeOddCodeGuy@alien.top 1 points 11 months ago

I wish there was support for metal with ExLlamav2. :(

[–] mlabonne@alien.top 1 points 11 months ago (1 children)

I'm the author of this article, thank you for posting it! If you don't want to use Medium, here's the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html

[–] ReturningTarzan@alien.top 1 points 11 months ago (1 children)

I'm a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn't really require flash-attn-2 to run "properly", it just runs a little better that way. But it's perfectly usable without it.

Great article, though. thanks. :)

[–] mlabonne@alien.top 1 points 11 months ago

Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it's still the case? I updated these two points, thanks for your feedback.

[–] a_beautiful_rhind@alien.top 1 points 11 months ago

Hey he finally gets some recognition.

[–] MonkeyMaster64@alien.top 1 points 11 months ago

Is this able to use CPU (similar to llama.cpp)?

[–] Darius510@alien.top 1 points 11 months ago (1 children)

God I cant wait until we’re past the command line era of this stuff

[–] fallingdowndizzyvr@alien.top 1 points 11 months ago

I'm the opposite. I shun everything LLM that isn't command line when I can. Everything has it's place. When dealing with media, GUI is the way to go. But when dealing with text, command line is fine. I don't need animated pop up bubbles.

[–] lxe@alien.top 1 points 11 months ago (1 children)

Agreed. Best performance running GPTQ’s. Missing the HF samplers but that’s ok.

[–] ReturningTarzan@alien.top 1 points 11 months ago

I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don't personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can't be long before there's an update to expose those parameters in the UI.