Semi_Tech

joined 1 year ago

How can I improve inference performance to a normal range? in c/localllama@poweruser.forum

[–] Semi_Tech@alien.top 1 points 1 year ago (1 children)

Try different version of the model.

What is the performance of a gguf q4-q6 on a single card?

permalink
fedilink
source

Can I run LLM models on RTX 2070 with 8GB? in c/localllama@poweruser.forum

[–] Semi_Tech@alien.top 1 points 1 year ago

Download koboldcpp and get gguf version of any model you want, preferably 7B from our pal thebloke. Only get Q4 or higher quantization. Q6 is a bit slow but works good. In koboldcpp. Exe select cublast and set the layers at 35-40.

You should get abot 5T/s or more.

This is the simplest method to run llms from my testing.

permalink
fedilink
source