LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Exllama outside of text generation webui? (alien.top)

submitted 2 years ago by turamura@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

I want to use the ExLlama models because it enables me to use the Llama 70b version with my 2 RTX 4090. I managed to get it to work pretty easily via text generation webui and inference is really fast! So far so good...

However, I need the model in python to do some large scale analyses. I cannot seem to find any guide/tutorial in which it is explained how to use ExLlama in the usual python/huggingface setup.

Is this just not possible? If it is, can someone pinpoint me to some examplary code in which ExLlama is used in python.

Much appreciated!

top 5 comments

sorted by: hot top controversial new old

[–] ReturningTarzan@alien.top 1 points 2 years ago (2 children)

There's a bunch of examples in the repo. Various Python scripts for doing inference and such, even a Colab notebook now.

As for the "usual" Python/HF setup, ExLlama is kind of an attempt to get away from Hugging Face. It reads HF models but doesn't rely on the framework. I've been meaning to write more documentation and maybe even a tutorial, but in the meantime there are those examples, the project itself, and a lot of other projects using it. TabbyAPI is coming along as a stand-alone OpenAI-compatible server to use with SillyTavern and in your own projects where you just want to generate completions from text-based requests, and ExUI is a standalone web UI for ExLlamaV2.

[–] turamura@alien.top 1 points 2 years ago (1 children)

Hi, thanks for your comment!

I saw e.g., the "inference.py" in the repo which I think I could utilize. It actually looks kind of simple. However, I am struggling with what to provide as the "model directory". Should I just download a Huggingface model (for example, I would like to work with TheBloke/Llama-2-70B-GPTQ), and then specify this as model directory? Or what kind of structure does ExLlama expect as model directory?

[–] ReturningTarzan@alien.top 1 points 2 years ago

Yes, the model directory is just all the files from a HF model, in one folder. You can download them directly from the "files" tab of a HF model by clicking all the little download arrows, or there's huggingface-cli. Also git can be used to clone models if you've got git-lfs installed.

It specifically needs the following files:

config.json
*.safetensors
tokenizer.model (preferable) or tokenizer.json
added_tokens.json (if the model has one)

But it may utilize other files in the future such as tokenizer_config.json, so best just to download all the files and keep them in one folder.

[–] turamura@alien.top 1 points 2 years ago

Got it to work! Thank you!!

[–] Murky-Ladder8684@alien.top 1 points 2 years ago

Check out turbo's project https://github.com/turboderp/exui

He just put it up not long ago and he has Speculative Decoding working on it. I tried it with Goliath 120b 4.85bpw exl2 and was getting 11-13 t/s vs 6-8 t/s without it. It's barebones but works.