I run my models locally only, and my best experience has been using mirostat to combat repetition and samey regenerations. Before that I used contrastive search with some success.
I have to say though, I'm not sure if mirostat would be a good solution through Openrouter. Doesn't it have like its own little cache or something (referring to mirostat)? Definitely seems like it's caching generated tokens somehow and tries to avoid them in the future, or something like that.
Anyways, for 70b xwin & lzlv, my settings have been simple: everything on default values (1 or 0), mirostat mode=2, tau=2-3,ETA=1. This gets me great responses, zero repetition, high variety when regenerating, and not too many hallucinations. These settings seem pretty stable. I sometimes tweak tau or raise/lower the temp, but eventually always end up at those settings again.
But e.g. for the new 34b Yi fine-tunes, these settings don't work. It's like I'm back in the early days of llama2, showing the problems you mentioned: The models start to loop and repeat, and not just in the same response, but repeat previous responses verbatim as well, reuse the same phrases again and again, don't know when to stop, etc. For those, I haven't found good, stable settings so far no matter what I change (they seem to prefer low temp though), which is so frustrating, as they are great otherwise. So mirostat is not a magic bullet it seems.
Can't say anything about goliath unfortunately (haven't used it).