anti-lucas-throwaway

joined 1 year ago
[–] anti-lucas-throwaway@alien.top 1 points 11 months ago

Others can help you with the LLM part of this, but I am just mainly curious whether your plan is worth it. You do know that converting an entire book into a summary is pretty much worthless if you don't make the summary yourself?

So I did some research and after I while in the rabbit hole I think that sliding window attention is not implemented in ExLlama (or v2) yet, and it is not in the AMD ROCm fork of Flash Attention yet either.

I think that means it's just unsupported right now. Very unfortunate, but I guess I'll have to wait. Waiting for support is the price I pay for saving 900 euros on a GPU by not buying a 4090, but a 7900 XTX. I'm fine with that.

[–] anti-lucas-throwaway@alien.top 1 points 1 year ago (1 children)

Very interesting. Well, in hindsight I should've noticed but performance did decrease after 8k tokens, but became completely unusable after 10k. I actually am pretty disappointed to still know nothing. No one actually documents what does and doesn't work and when or how. I can barely find anything about SWA (I know what it is in essence) but no one documents how it works, where and if you can set the window size and whether or not it's in Ooba's app.

And then there is the problem that I don't know if it's supported on AMD cards like you said. Try to look it up, sliding window attention on Google by itself just gives endless pages of "tutorials" and "guides" that don't tell anything. And combining it with Rocm just gives random results that don't lead anywhere useful.

 

Hi, I have searched for a long time on this subreddit, in Ooba's documentation, Mistral's documentation and everything, but I just can't find what I am looking for.

I see everyone claiming Mistral can handle up to 32k context size, however while it technically won't refuse to generate anything above like 8k, the output is just not good. I have it loaded in Oobabooga's text-generation-webui and am using the API through SillyTavern. I loaded the normal Mistral 7B just to check, but with my current 12k story, all it can generate is gibberish if I give it the full context. However, I also checked using other fine-tunes of Mistral.

What am I doing wrong? I am using the GPTQ version on my RX 7900 XTX. Is it just advertising that it won't crash until 32k or something, or am I doing something wrong for not getting coherent output above 8k? I did mess with the alpha values, and while doing so does eliminate the gibberish, I do get the idea that the quality does suffer somehow.