Hi, I have searched for a long time on this subreddit, in Ooba's documentation, Mistral's documentation and everything, but I just can't find what I am looking for.
I see everyone claiming Mistral can handle up to 32k context size, however while it technically won't refuse to generate anything above like 8k, the output is just not good. I have it loaded in Oobabooga's text-generation-webui and am using the API through SillyTavern. I loaded the normal Mistral 7B just to check, but with my current 12k story, all it can generate is gibberish if I give it the full context. However, I also checked using other fine-tunes of Mistral.
What am I doing wrong? I am using the GPTQ version on my RX 7900 XTX. Is it just advertising that it won't crash until 32k or something, or am I doing something wrong for not getting coherent output above 8k? I did mess with the alpha values, and while doing so does eliminate the gibberish, I do get the idea that the quality does suffer somehow.
So I did some research and after I while in the rabbit hole I think that sliding window attention is not implemented in ExLlama (or v2) yet, and it is not in the AMD ROCm fork of Flash Attention yet either.
I think that means it's just unsupported right now. Very unfortunate, but I guess I'll have to wait. Waiting for support is the price I pay for saving 900 euros on a GPU by not buying a 4090, but a 7900 XTX. I'm fine with that.