this post was submitted on 11 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I'm trying to run mistral 7b on my laptop, and the inference speed is fine (~10T/s), but prompt processing takes very long when the context gets bigger (also around 10T/s). I've tried quantizing the model, but that doesn't speed up processing, only generation. I've also tried using openblas, but that didn't provide much speedup. I'm using koboldcpp's prompt cache, but that doesn't help with initial load times (which are so slow the connection times out)

From my other testing, smaller models are faster at prompt processing, but they tend to completely ignore my prompts and just go off in random directions.

So my question is: 1) is there a way to speed up prompt processing for mistral (using koboldcpp, preferably) or 2) if not, are there any coherent models around 3b parameters that support contexts around 4k?

Edit: I misremembered the generation speed. It's around 10 T/s for generation only. It's changed now in the original post

you are viewing a single comment's thread
view the rest of the comments
[–] very-cis-femgirl@alien.top 1 points 10 months ago

Yes, it works well, but my problem is the initial processing which takes so long the api connection times out before it's done