Murky-Ladder8684

joined 1 year ago
[–] Murky-Ladder8684@alien.top 1 points 11 months ago

If you learn AutoGen you could assign each model to a different agent and have them interact. If using the same model and having multiple char talk is your thing than the sillytavern group option is the way.

[–] Murky-Ladder8684@alien.top 1 points 11 months ago

I've been checking out the latest models of people tweaking goliath120b. I found this one to be the best by far with that issue and the strange spelling stuff. Might be worth giving a try to compare for yourself: https://huggingface.co/LoneStriker/Tess-XL-v1.0-4.85bpw-h6-exl2 (Lonestriker has other bpw)

[–] Murky-Ladder8684@alien.top 1 points 11 months ago

Check out turbo's project https://github.com/turboderp/exui

He just put it up not long ago and he has Speculative Decoding working on it. I tried it with Goliath 120b 4.85bpw exl2 and was getting 11-13 t/s vs 6-8 t/s without it. It's barebones but works.

[–] Murky-Ladder8684@alien.top 1 points 11 months ago

In the instructions on github it said to use mono 24000 wav. Double check the info though.

[–] Murky-Ladder8684@alien.top 1 points 11 months ago

Those series of Nvidia gpus didn't have tensor cores yet and believe they started in 20xx series. I not sure how much it impacts inference purposes vs training/fine tuning but worth doing more research. From what I gathered the answer is "no" unless you use a 10xx for like monitor output, TTS, or other smaller co-llm use that you don't want taking vram away from your main LLM GPUs.

[–] Murky-Ladder8684@alien.top 1 points 11 months ago

for comparison sake EXL2 4.85bpw version runs around 6-8 t/s on 4x3090s at 8k context it's the lower end.

[–] Murky-Ladder8684@alien.top 1 points 11 months ago

4x3090s will run it at over 4bits.