LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

ExLlamaV2: The Fastest Library to Run LLMs (towardsdatascience.com)

submitted 2 years ago by alchemist1e9@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

Is this accurate?

you are viewing a single comment's thread
view the rest of the comments

[–] mlabonne@alien.top 1 points 2 years ago (1 children)

I'm the author of this article, thank you for posting it! If you don't want to use Medium, here's the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html

[–] ReturningTarzan@alien.top 1 points 2 years ago (1 children)

I'm a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn't really require flash-attn-2 to run "properly", it just runs a little better that way. But it's perfectly usable without it.

Great article, though. thanks. :)

[–] mlabonne@alien.top 1 points 2 years ago

Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it's still the case? I updated these two points, thanks for your feedback.