LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Faster prompt processing on cpu? (alien.top)

submitted 2 years ago by very-cis-femgirl@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

I'm trying to run mistral 7b on my laptop, and the inference speed is fine (~10T/s), but prompt processing takes very long when the context gets bigger (also around 10T/s). I've tried quantizing the model, but that doesn't speed up processing, only generation. I've also tried using openblas, but that didn't provide much speedup. I'm using koboldcpp's prompt cache, but that doesn't help with initial load times (which are so slow the connection times out)

From my other testing, smaller models are faster at prompt processing, but they tend to completely ignore my prompts and just go off in random directions.

So my question is: 1) is there a way to speed up prompt processing for mistral (using koboldcpp, preferably) or 2) if not, are there any coherent models around 3b parameters that support contexts around 4k?

Edit: I misremembered the generation speed. It's around 10 T/s for generation only. It's changed now in the original post

top 3 comments

sorted by: hot top controversial new old

[–] yamosin@alien.top 1 points 2 years ago (1 children)

NEW FEATURE: Context Shifting (A.K.A. EvenSmarterContext) - This feature utilizes KV cache shifting to automatically remove old tokens from context and add new ones without requiring any reprocessing. So long as you use no memory/fixed memory and don't use world info, you should be able to avoid almost all reprocessing between consecutive generations even at max context. This does not consume any additional context space, making it superior to SmartContext.

do you tried this koblodcpp new feature?

[–] very-cis-femgirl@alien.top 1 points 2 years ago

Yes, it works well, but my problem is the initial processing which takes so long the api connection times out before it's done

[–] vasileer@alien.top 1 points 2 years ago

on quality: if you go with a smaller model or even another model you will lose quality, as Mistral (and his finetunes) is the best among <70B models and another rule of thumb is that a bigger model quantized (even 2bits) is better than a smaller unquantized,

on speed: the fastest inference is from Q4_K_S https://github.com/ggerganov/llama.cpp/pull/1684