LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Any tricks to speed up 13B models on a 3090? (alien.top)

submitted 2 years ago by DustGrouchy1792@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

Are there any tricks to speed up 13B models on a 3090?

Currently using the regular huggingface model quantized to 8bit by a GPTQ capable fork of KoboldAI.

Especially when the context limit changes, it's pretty slow and far from even remotely real time.

top 5 comments

sorted by: hot top controversial new old

[–] StaplerGiraffe@alien.top 1 points 2 years ago (2 children)

Perhaps you are using a wrong fork of KobolAI, I get much more tokens per second. Did you open the task manager and check that the GPU memory used indeed increases when loading and using the model?

Otherwise try out Koboldcpp. It needs gguf instead gptq, but needs no special fork. With cublas enabled you should get good token speeds for a 13B model.

[–] DustGrouchy1792@alien.top 1 points 2 years ago (1 children)

Can I get koboldcpp working with sillytavern without too much of a headache?

[–] StaplerGiraffe@alien.top 1 points 2 years ago

Sure, it provides the same API as KoboldAI.

[–] DustGrouchy1792@alien.top 1 points 2 years ago

I'm now using a 4bit GPTQ version of the same model. After generation completes the VRAM goes up to 16.2 GB (out of 24 GB) and I have nothing else using GPU as best I can tell (no browser windows with youtube, etc).

Still only getting a bit under 4.00 tokens per second. So I don't think stuff is getting offloaded to CPU.

[–] AsliReddington@alien.top 1 points 2 years ago

Just run on TGI or vLLM for flash attention & continuous batching for parallel requests