this post was submitted on 11 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I'm trying to run mistral 7b on my laptop, and the inference speed is fine (~10T/s), but prompt processing takes very long when the context gets bigger (also around 10T/s). I've tried quantizing the model, but that doesn't speed up processing, only generation. I've also tried using openblas, but that didn't provide much speedup. I'm using koboldcpp's prompt cache, but that doesn't help with initial load times (which are so slow the connection times out)

From my other testing, smaller models are faster at prompt processing, but they tend to completely ignore my prompts and just go off in random directions.

So my question is: 1) is there a way to speed up prompt processing for mistral (using koboldcpp, preferably) or 2) if not, are there any coherent models around 3b parameters that support contexts around 4k?

Edit: I misremembered the generation speed. It's around 10 T/s for generation only. It's changed now in the original post

you are viewing a single comment's thread
view the rest of the comments
[–] vasileer@alien.top 1 points 10 months ago

on quality: if you go with a smaller model or even another model you will lose quality, as Mistral (and his finetunes) is the best among <70B models and another rule of thumb is that a bigger model quantized (even 2bits) is better than a smaller unquantized,

on speed: the fastest inference is from Q4_K_S https://github.com/ggerganov/llama.cpp/pull/1684