I've been checking out the latest models of people tweaking goliath120b. I found this one to be the best by far with that issue and the strange spelling stuff. Might be worth giving a try to compare for yourself: https://huggingface.co/LoneStriker/Tess-XL-v1.0-4.85bpw-h6-exl2 (Lonestriker has other bpw)
Murky-Ladder8684
Check out turbo's project https://github.com/turboderp/exui
He just put it up not long ago and he has Speculative Decoding working on it. I tried it with Goliath 120b 4.85bpw exl2 and was getting 11-13 t/s vs 6-8 t/s without it. It's barebones but works.
In the instructions on github it said to use mono 24000 wav. Double check the info though.
Those series of Nvidia gpus didn't have tensor cores yet and believe they started in 20xx series. I not sure how much it impacts inference purposes vs training/fine tuning but worth doing more research. From what I gathered the answer is "no" unless you use a 10xx for like monitor output, TTS, or other smaller co-llm use that you don't want taking vram away from your main LLM GPUs.
for comparison sake EXL2 4.85bpw version runs around 6-8 t/s on 4x3090s at 8k context it's the lower end.
4x3090s will run it at over 4bits.
If you learn AutoGen you could assign each model to a different agent and have them interact. If using the same model and having multiple char talk is your thing than the sillytavern group option is the way.