ReturningTarzan

joined 10 months ago
[–] ReturningTarzan@alien.top 1 points 9 months ago

Yes, the model directory is just all the files from a HF model, in one folder. You can download them directly from the "files" tab of a HF model by clicking all the little download arrows, or there's huggingface-cli. Also git can be used to clone models if you've got git-lfs installed.

It specifically needs the following files:

  • config.json
  • *.safetensors
  • tokenizer.model (preferable) or tokenizer.json
  • added_tokens.json (if the model has one)

But it may utilize other files in the future such as tokenizer_config.json, so best just to download all the files and keep them in one folder.

[–] ReturningTarzan@alien.top 1 points 9 months ago (3 children)

There's a bunch of examples in the repo. Various Python scripts for doing inference and such, even a Colab notebook now.

As for the "usual" Python/HF setup, ExLlama is kind of an attempt to get away from Hugging Face. It reads HF models but doesn't rely on the framework. I've been meaning to write more documentation and maybe even a tutorial, but in the meantime there are those examples, the project itself, and a lot of other projects using it. TabbyAPI is coming along as a stand-alone OpenAI-compatible server to use with SillyTavern and in your own projects where you just want to generate completions from text-based requests, and ExUI is a standalone web UI for ExLlamaV2.

[–] ReturningTarzan@alien.top 1 points 9 months ago

It's not. If the only thing you're using the P40 for is as swap space for the 3090, then you're better off just using system RAM, since you'll have to swap via system RAM anyway.

[–] ReturningTarzan@alien.top 1 points 10 months ago

Most of those security issues are just silly. Like, oh no, what if the model answers a question with some "dangerous" knowledge that's already in the top three search results if you Google the exact same question? Whatever will we do?

The other ones arise from inserting an LLM across where there would be a security boundary, like by giving it access to personal documents and at the same time an accessible interface to people who shouldn't have that access. So a new, poorly understood technology provides novel ways for people to make bad assumptions in their rush to monetize it. News at 11.

Of course it's still a great segment and easily the most interesting part of the video.

[–] ReturningTarzan@alien.top 1 points 10 months ago

To add to that: GPUs do support "conditional" matrix multiplication, they just don't benefit from that type of optimization. Essentially, it takes as much time to skip a computation as it does to perform it. And in practice it can even take longer since the extra logic required to keep track of which computations to skip will add overhead.

In order for this to make sense on a GPU you need a way of completely sidestepping portions of the model, like the ability to skip whole layers that are not relevant (a bit how MoE works already). If you have to load a weight from memory, or some sort of metadata to figure out what each individual weight is connected to, you've already allocated as many resources to that weight as you would if you simply used it in a streamlined matrix multiplication.

The same also holds to a lesser extent for efficient CPU implementations that also rely on SIMD computations, regular memory layouts and predictable control flows.

[–] ReturningTarzan@alien.top 1 points 10 months ago

I recently added Mirostat, min-P (the new one), tail-free sampling, and temperature-last as an option. I don't personally put much stock in having an overabundance of sampling parameters, but they are there now for better or worse. So for the exllamav2 (non-HF) loader in TGW, it can't be long before there's an update to expose those parameters in the UI.

[–] ReturningTarzan@alien.top 1 points 10 months ago (1 children)

I'm a little surprised by the mention of chatcode.py which was merged into chat.py almost two months ago. Also it doesn't really require flash-attn-2 to run "properly", it just runs a little better that way. But it's perfectly usable without it.

Great article, though. thanks. :)

[–] ReturningTarzan@alien.top 1 points 10 months ago

Notepad mode is up fwiw. It probably needs more features, but it's functional.

[–] ReturningTarzan@alien.top 1 points 10 months ago (1 children)

Well, it depends on the model and stuff, and how you get to that 50k+ context. If it's a single prompt, as in "Please summarize this novel: ..." that's going to take however long it takes. But if the model's context length is 8k, say, then ExUI is only ever going to do prompt processing on up to 8k tokens, and it will maintain a pointer that advances in steps (the configurable "chunk size").

So when you reach the end of the model's native context, it skips ahead e.g. 512 tokens and then you'll only have full context ingestion again after a total 512 tokens of added context. As for that, though, you should never experience over a minute of processing time on a 3090. I don't know of a model that fits in a 3090 and takes that much time to inference on. Unless you're running into the NVIDIA swapping "feature" because the model doesn't actually fit on the GPU.

[–] ReturningTarzan@alien.top 1 points 10 months ago (3 children)

Notebook mode is almost ready. Probably I'll release later today or early tomorrow.

[–] ReturningTarzan@alien.top 1 points 10 months ago

I'm working on a notepad mode for ExUI. It's not quite ready, but probably sometime tomorrow.

[–] ReturningTarzan@alien.top 1 points 10 months ago

When you're using non-instruct models for instruct-type questions, prompting is everything. For comparison, here are the first three questions put to Mistral-7B-instruct with correct prompt format at various bitrates up to FP16.

view more: next ›