You convinced me to finally try the goliath! I don't have the ability to run it locally so I rented a cloud GPU just for this. With 80GB VRAM, it fits the largest EXL2 quant currently available.
Verdict: It's absolutely incredible compared to the small models!!! Finally, I'm not constantly swiping responses after it produced nonsense, instead it generates great responses on the first try every time, doesn't talk as me, and doesn't constantly confuse characters and logic!
But, the tiny 4096 context is very limiting. I hit it very quickly with my conversation. Tried scaling up the context size in the parameters, but this made it perform noticeably worse... no longer generating multiple paragraphs, dropping formatting symbols, goes on.
Is that the expected result? There's no magic way to run these models with huge contexts yet, right?
After some experimenting, I think I found a good way. The trick is to start any new chat with the context set to 4k so the model will be at maximum quality, and then once that fills up, reload it with it set to 16k. Seems to give it enough data to keep generating very good responses.
β
It will drop the leading asterisk on responses for some inexplicable reason, but this is easily fixed by just adding that to the response prefix in SillyTavern.