overview for mll59

Brand New Mistral 16k Context Size Models got released last night from NurtureAI! in c/localllama@poweruser.forum

[–] mll59@alien.top 1 points 2 years ago

First, thank you for sharing. However, I was a bit puzzled by these finetunes since many finetunes based on Mistral can simply support longer context out of the box by using NTK scaling, see here. Alas, I couldn't find any information about what NurtureAI did to extend the context in their model cards.

I've tested the NurtureAI synthia-7b-v2-16k-q8_0.gguf, using koboldcpp v1.49 using the native rope configuration of the model (which has a rope base freq of 1000000), in an existing conversation of 14971 tokens, asking it to generate a standup comedy about the preceding conversation and it produced incoherent babbling. Using the original model synthia-7b-v2.0.Q8_0.gguf (which has a rope base freq of 10000) with --ropeconfig 1.0 45000 gives me a coherent standup comedy that makes sense.

How well this NTK scaling on Mistral-based finetunes works depends on the finetune, for some it works better than for others. For example, when I ask the original zephyr-7b-beta.Q8_0.gguf finetune, in an existing conversation of 25872 tokens, to produce a rhyming poem about the preceding conversation, the resulting poem actually mostly rhymes. Other original finetunes, like synthia-7b-v2.0.Q8_0.gguf, seem still coherent at this context size but are not able to produce rhyming poems anymore.

Anyway, based on my experiments, these extended context models by NurtureAI do not work for me and just using NTK scaling on original Mistral-based finetunes does.

I don't understand Mistral and context size, honestly. in c/localllama@poweruser.forum

[–] mll59@alien.top 1 points 2 years ago

Thanks for your reaction. In this case I think it's not a bug in llama.cpp but in the parameters of the Mistral models. The original Mistral models have been trained on 8K context size, see Product | Mistral AI | Open source models .

But when I load a Mistral model, or a finetune of a Mistral model, koboldcpp always reports a trained context size of 32768, like this:

llm_load_print_meta: n_ctx_train = 32768

So llama.cpp (or koboldcpp) just assume that up to 32768 context size, no NTK scaling is needed and they leave the rope freq base at 10000, which I think is correct. I don't know why the model has this n_ctx_train parameter at 32768 instead of 8192, maybe a mistake?

I don't understand Mistral and context size, honestly. in c/localllama@poweruser.forum

[–] mll59@alien.top 1 points 2 years ago (2 children)

I've been playing around with this. The standard model uses a rope freq base of 10,000. At that freq base it can handle slightly more than the 8K tokens it was trained on, according to the Mistral AI info, before producing garbage (at roughly 9K tokens).

However when I use a rope freq base of 45000 I can have reasonable conversations, at least for some of the mistral models, up to more than 25k tokens. Not all of the mistral models are still very coherent at 25K tokens but the dolphin 2.1 model and some others still work quite well. For the details see: here