very-cis-femgirl

joined 10 months ago
[–] very-cis-femgirl@alien.top 1 points 10 months ago

Yes, it works well, but my problem is the initial processing which takes so long the api connection times out before it's done

 

I'm trying to run mistral 7b on my laptop, and the inference speed is fine (~10T/s), but prompt processing takes very long when the context gets bigger (also around 10T/s). I've tried quantizing the model, but that doesn't speed up processing, only generation. I've also tried using openblas, but that didn't provide much speedup. I'm using koboldcpp's prompt cache, but that doesn't help with initial load times (which are so slow the connection times out)

From my other testing, smaller models are faster at prompt processing, but they tend to completely ignore my prompts and just go off in random directions.

So my question is: 1) is there a way to speed up prompt processing for mistral (using koboldcpp, preferably) or 2) if not, are there any coherent models around 3b parameters that support contexts around 4k?

Edit: I misremembered the generation speed. It's around 10 T/s for generation only. It's changed now in the original post