LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

ExLlamaV2: The Fastest Library to Run LLMs (towardsdatascience.com)

submitted 11 months ago by alchemist1e9@alien.top to c/localllama@poweruser.forum

22 comments fedilink hide all child comments

Is this accurate?

you are viewing a single comment's thread
view the rest of the comments

[–] mlabonne@alien.top 1 points 11 months ago (2 children)

I'm the author of this article, thank you for posting it! If you don't want to use Medium, here's the link to the article on my blog: https://mlabonne.github.io/blog/posts/ExLlamaV2_The_Fastest_Library_to_Run%C2%A0LLMs.html

[–] ReturningTarzan@alien.top 1 points 11 months ago (1 children)

I'm a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn't really require flash-attn-2 to run "properly", it just runs a little better that way. But it's perfectly usable without it.

Great article, though. thanks. :)

[–] mlabonne@alien.top 1 points 11 months ago

Thanks for your excellent library! It makes sense because I started writing this article about two months ago (chatcode.py is still mentioned in the README.md by the way). I had a very low throughput using ExLlamaV2 without flash-attn-2. Do you know if it's still the case? I updated these two points, thanks for your feedback.