Hi, I have searched for a long time on this subreddit, in Ooba's documentation, Mistral's documentation and everything, but I just can't find what I am looking for.
I see everyone claiming Mistral can handle up to 32k context size, however while it technically won't refuse to generate anything above like 8k, the output is just not good. I have it loaded in Oobabooga's text-generation-webui and am using the API through SillyTavern. I loaded the normal Mistral 7B just to check, but with my current 12k story, all it can generate is gibberish if I give it the full context. However, I also checked using other fine-tunes of Mistral.
What am I doing wrong? I am using the GPTQ version on my RX 7900 XTX. Is it just advertising that it won't crash until 32k or something, or am I doing something wrong for not getting coherent output above 8k? I did mess with the alpha values, and while doing so does eliminate the gibberish, I do get the idea that the quality does suffer somehow.
I've been playing around with this. The standard model uses a rope freq base of 10,000. At that freq base it can handle slightly more than the 8K tokens it was trained on, according to the Mistral AI info, before producing garbage (at roughly 9K tokens).
However when I use a rope freq base of 45000 I can have reasonable conversations, at least for some of the mistral models, up to more than 25k tokens. Not all of the mistral models are still very coherent at 25K tokens but the dolphin 2.1 model and some others still work quite well. For the details see: here
This is very helpful. The GGUF format is supposed to set the correct ROPE, but this apparently isn't the case for Mistral. This is something to bring up at the llamaCPP github, so that whoever works on ROPEs can adjust Mistral behavior.
Thanks for your reaction. In this case I think it's not a bug in llama.cpp but in the parameters of the Mistral models. The original Mistral models have been trained on 8K context size, see Product | Mistral AI | Open source models .
But when I load a Mistral model, or a finetune of a Mistral model, koboldcpp always reports a trained context size of 32768, like this:
llm_load_print_meta: n_ctx_train = 32768
So llama.cpp (or koboldcpp) just assume that up to 32768 context size, no NTK scaling is needed and they leave the rope freq base at 10000, which I think is correct. I don't know why the model has this n_ctx_train parameter at 32768 instead of 8192, maybe a mistake?