I encounter this a lot with the Yi 34B models to the point where I've basically stopped using them for chat. I've tried a huge variety of settings, presets, quants, etc. I've used koboldcpp and text-generation-webui, I've used EXL2, GGML, and GPTQ. The issue appears consistently after the context grows past a certain size. Partial or entire messages will repeat. It will also get stuck where regenerating will always result in the same response unless drastic changes to settings are made and usually it just changes the message that it's stuck on. Smaller changes to the settings will just result it in changing the wording slightly of the stuck message.
LocalLLaMA
Community to discuss about Llama, the family of large language models created by Meta AI.
Did you try disabling the BOS token?
Yes, the BOS token is disabled in my parameters
I pretty much gave up trying to make Yi based models actually use more then 4k context. And at that point I rather just use Lzlv 70b which is much smarter with better prose and knowledge.
The repetition issue pretty much makes the models unusable past the context where it breaks.
Agreed - I’m personally using 70B models at 2.4BPW EXL2 quants, as well. They hold up great even at a small quantization as long as sampling parameters are set correctly, and the models are subjectively more pleasant in prose (Euryale 1.3 and LZLV both come to mind).
At 2.4BPW, they fit into 24GB of VRAM and inference is extremely fast, and EXL2 also appears to be very promising as a quantization method. I believe the potential upsides are yet to be fully leveraged.
No issues here, just a lot of confidence on certain tokens but overall very little repetition. I use Koboldcpp, Q5 K M. Dont abuse temp, the model seems to be exceedingly sensitive and the smallest imbalance breaks its flow. Try temp 0,9, rep pen 1.11, top k 0, min-p 0.1, typical 1, tfs 1.
I see, the model does tend to run a bit hot as-is. I’ll go ahead and try these settings out tomorrow.
I'll have to try these settings, I have OPs problems too and I always have to crank the temperature up to get it to work. Then it gets schizophrenia a few messages later. Thanks!
High temp does more harm than good. I would suggest looking into what the other settings do before raising it, no matter the model
On EXL2, when it started doing that, I cranked the temp to 2.0 rather than using dynamic temperature. That made it go away. Going to try higher rep pen next and see what happens. I'm at 8k context and it's doing it.
I had a high hopes for Yi-34B chat, but when I tried it I saw it is not very good.
70B models are better (well of course), but I think even some 20B models are better.
I am having better luck with 2.4BPW EXL2 quants of 70B models from Lone_Striker lately - Euryale 1.3, LZLV, etc.
Even at the smaller quants, they are quite strong at the correct settings. Easily comparable to a 34B at Q4_K_M, from my experience.