cleverestx

joined 1 year ago
[–] cleverestx@alien.top 1 points 11 months ago

Why can we get a 20 - 34b version of this very capable Mistral?

[–] cleverestx@alien.top 1 points 1 year ago

I have a RTX 4090, 96GB of RAM and a i9-13900k CPU, and I still keep going back to 20b (4-6bpw) models due to the awful performance of 70b models, which 2.4bpw is supposed to fully fit the VRAM in.... even using Exllama2....

What is your trick to get better performance? If I don't use a small lame context of 2048, the speed of generating is actually un-usable (under 1 token/sec), what context are you using and what settings? Thank you.

[–] cleverestx@alien.top 1 points 1 year ago (1 children)

So far with the local models, I've just done like storybook format, RPGing, without a game system, dice, rolls, etc, which I used to do with chat GPT...

Do you have a prompt template that works well for you that you would be willing to share that gamifies it?

 

I'm using Windows 11 and OOBE. I use SillyTavern as well, but not always.

I've been playing with 20b models (work great at 4096 context) and 70b ones (but too slooow unless I make the context 2048, which is then usable, but the low context sucks)

What else am I missing I see there are some 34b models now for Exllama2, but I'm having issues getting them to work, quality (which PROFILE do I use??) or speed wise (what context setting? This is not the 200k context version)...

For your recommended model, what is the best settings for those on a single card system? (4090, 96GB of RAM, I9-13900k)

Any suggestions for best experience is appreciated (for creative, RPG/Chat/Story usage).

Thank you.