henk717

joined 10 months ago
[–] henk717@alien.top 1 points 9 months ago

Tried a 13B model with Koboldcpp on one of the runpod A100's, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.

[–] henk717@alien.top 1 points 9 months ago

Koboldcpp which he is already using is a better fit due to the superior context shifting.

[–] henk717@alien.top 1 points 9 months ago

With Q4_K_S MMQ it should be possible to do a full offload on 13B. I'm not sure if you can fully fit 4K since that is a tight call but its definately worth a try.

[–] henk717@alien.top 1 points 10 months ago

I'd go the Koboldcpp route instead because its portable for them so its much simpler to install and use. Koboldcpp has API documentation available if you add /api to a working link (Or you can just check it here). If you already made it for the OpenAI compatible API stuff it supports that to.