Unequaled

joined 10 months ago
[–] Unequaled@alien.top 1 points 10 months ago

/u/WolframRavenwolf

Honestly, ever since I saw someone mention that with EXL2 I could run a 70b model on a single 4090/3090/24 VRAM I was instantly hooked. Especially since enabling the 8bit cache option meant you could run even higher context sizes albeit 2x more sometimes.

The main advantage as you mention is speed. As a RP'er myself, I care somewhat less about quality responses. Speed is king in my opinion since you can always swipe for more alternative responses. It's very hard to let go of 20-30 T/s vs <5 T/s on GGUF. 😭

Baseline of 70b is good enough to justify the tradeoff of quality. Besides, I don't have to buy ANOTHER 4090 to run 70b models.

Personally, I run waldie_lzlv-limarpv3-l2-70b-2.4bpw-h6-exl2 version of lzlv. It isn't broken for 1 and it seems to give somewhat better and creative responses.

Side note: Did you notice in Nous Cabybara 34b that spelling mistakes or weird sentences would form in longer contexts? Because sometimes I would get weird non-sensical sentences or stuff like I'll' or even a Chinese character