aikitoria

joined 11 months ago
[–] aikitoria@alien.top 1 points 11 months ago

After some experimenting, I think I found a good way. The trick is to start any new chat with the context set to 4k so the model will be at maximum quality, and then once that fills up, reload it with it set to 16k. Seems to give it enough data to keep generating very good responses.

​

It will drop the leading asterisk on responses for some inexplicable reason, but this is easily fixed by just adding that to the response prefix in SillyTavern.

[–] aikitoria@alien.top 1 points 11 months ago (2 children)

You convinced me to finally try the goliath! I don't have the ability to run it locally so I rented a cloud GPU just for this. With 80GB VRAM, it fits the largest EXL2 quant currently available.

Verdict: It's absolutely incredible compared to the small models!!! Finally, I'm not constantly swiping responses after it produced nonsense, instead it generates great responses on the first try every time, doesn't talk as me, and doesn't constantly confuse characters and logic!

But, the tiny 4096 context is very limiting. I hit it very quickly with my conversation. Tried scaling up the context size in the parameters, but this made it perform noticeably worse... no longer generating multiple paragraphs, dropping formatting symbols, goes on.

Is that the expected result? There's no magic way to run these models with huge contexts yet, right?

[–] aikitoria@alien.top 1 points 11 months ago

Hmm, I didn't notice a major quality loss when I swapped from mistral-7b-openorca.Q8_0.gguf (running in koboldcpp) to Mistral-7B-OpenOrca-8.0bpw-h6-exl2 (running in text-gen-webui). Maybe I should try again. Sure you were using comparable sampling settings for both? I noticed for example SillyTavern has entirely different presets per backend.

Still need to try the new NeuralChat myself also, I was just going to go for the exl2, so this could be a good tip!

[–] aikitoria@alien.top 1 points 11 months ago (2 children)

Is it not possible to port ExLlamaV2 to metal? At least on a 4090, it's much (much) faster at processing the input than llama.cpp

[–] aikitoria@alien.top 1 points 11 months ago

Is there any such benchmark that includes both the 4090/A100 and a mac with M2 Ultra / M3 Max? I've searched quite a bit but didn't find anyone comparing them on similar setups, it seems very interesting due to the large unified memory.