Your issue is using q8. Be real, you only have 6gb of vram, not 24.
Your hardware can't run q8 at a decent speed.
Use q4_k_s, you can offload much more to gpu. There's degradation yes, but its not so bad.
Your issue is using q8. Be real, you only have 6gb of vram, not 24.
Your hardware can't run q8 at a decent speed.
Use q4_k_s, you can offload much more to gpu. There's degradation yes, but its not so bad.
Huh, interesting weave, it did feel like it made less spelling and simple errors when comparing it to goliath.
Once again Euryale's included. The lack of xwin makes it better imo, Xwin may be smart but it has repetition issues at long context, that's just my opinion.
I'd honestly scale it down, there's really no need to go 120b, from testing a while back ~90-100b frankenmerges have the same effect.
Isn't OpenChat a fine tune of Mistral?
Why would anyone finetune on top of that?
It's not a good idea.
Did you forget to unset the rope settings?
Codellama requires different rope than regular llama.
Also check your sampler settings.