LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

I don't understand Mistral and context size, honestly. (alien.top)

submitted 2 years ago by anti-lucas-throwaway@alien.top to c/localllama@poweruser.forum

11 comments fedilink hide all child comments

Hi, I have searched for a long time on this subreddit, in Ooba's documentation, Mistral's documentation and everything, but I just can't find what I am looking for.

I see everyone claiming Mistral can handle up to 32k context size, however while it technically won't refuse to generate anything above like 8k, the output is just not good. I have it loaded in Oobabooga's text-generation-webui and am using the API through SillyTavern. I loaded the normal Mistral 7B just to check, but with my current 12k story, all it can generate is gibberish if I give it the full context. However, I also checked using other fine-tunes of Mistral.

What am I doing wrong? I am using the GPTQ version on my RX 7900 XTX. Is it just advertising that it won't crash until 32k or something, or am I doing something wrong for not getting coherent output above 8k? I did mess with the alpha values, and while doing so does eliminate the gibberish, I do get the idea that the quality does suffer somehow.

you are viewing a single comment's thread
view the rest of the comments

[–] mll59@alien.top 1 points 2 years ago

Thanks for your reaction. In this case I think it's not a bug in llama.cpp but in the parameters of the Mistral models. The original Mistral models have been trained on 8K context size, see Product | Mistral AI | Open source models .

But when I load a Mistral model, or a finetune of a Mistral model, koboldcpp always reports a trained context size of 32768, like this:

llm_load_print_meta: n_ctx_train = 32768

So llama.cpp (or koboldcpp) just assume that up to 32768 context size, no NTK scaling is needed and they leave the rope freq base at 10000, which I think is correct. I don't know why the model has this n_ctx_train parameter at 32768 instead of 8192, maybe a mistake?