this post was submitted on 17 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] ReturningTarzan@alien.top 1 points 10 months ago (1 children)

Well, it depends on the model and stuff, and how you get to that 50k+ context. If it's a single prompt, as in "Please summarize this novel: ..." that's going to take however long it takes. But if the model's context length is 8k, say, then ExUI is only ever going to do prompt processing on up to 8k tokens, and it will maintain a pointer that advances in steps (the configurable "chunk size").

So when you reach the end of the model's native context, it skips ahead e.g. 512 tokens and then you'll only have full context ingestion again after a total 512 tokens of added context. As for that, though, you should never experience over a minute of processing time on a 3090. I don't know of a model that fits in a 3090 and takes that much time to inference on. Unless you're running into the NVIDIA swapping "feature" because the model doesn't actually fit on the GPU.

[–] mcmoose1900@alien.top 1 points 10 months ago

 I don't know of a model that fits in a 3090 and takes that much time to inference on

Yi-34B-200K is the base model I'm using. Specifically the Capybara/Tess tunes.

I can squeeze 63K context on it at 3.5bpw. Its actually surprisingly good at continuing a full context story, referencing details throughout and such.

Anyway I am on linux, so no gpu swap like windows. I am indeed using it in a chat/novel style chat, so the context does scroll and get cached in ooba.