LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Any way to decrease inference time during long chats?(+decrease repetition without breaking things) (alien.top)

submitted 11 months ago by The_One_Who_Slays@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

Using Oobabooga's Webui on cloud.

I haven't noticed that immediately, but apparently once I breach the context limit, or some short time after the fact, the inference time increases significantly. For example, in the beginning of the conversation a single message goes about 13-16 tps. After reaching the threshold, the speed starts decreasing until it becomes around 0.1 tps.

Not only that, but the text also starts repeating. For example, character's certain features or their actions start coming up in almost every sunsequent message with almost identical wording, like some sort of a broken record. It's not impossible to stir the plot forward, but it gets tiring, especially considering a huge delay on top of that.

Is there any solution or a workaround to these problems?

you are viewing a single comment's thread
view the rest of the comments

[–] Former-Ad-5757@alien.top 1 points 11 months ago

What do you want to happen when the total chat reaches 8k? Because there the server has to make a choice it can keep adding more context so it slows down, it can simply cut off the first messages but then it will for example forget its own name, or it could for example (this is a method I use but it costs interference time as you ask a 2nd question behind the scenes) ask the model to summarize the first 4K of the context so it will retain some context and still retain speed.