overview for henk717

Why is a single a100 so slow? in c/localllama@poweruser.forum

[–] henk717@alien.top 1 points 9 months ago

Tried a 13B model with Koboldcpp on one of the runpod A100's, its Q4 and FP16 speed both clocked in around 20T/S at 4K context, topping at 60T/S for smaller generations.

Realistically, how far i can push my current PC? in c/localllama@poweruser.forum

[–] henk717@alien.top 1 points 9 months ago

Koboldcpp which he is already using is a better fit due to the superior context shifting.

Realistically, how far i can push my current PC? in c/localllama@poweruser.forum

[–] henk717@alien.top 1 points 9 months ago

With Q4_K_S MMQ it should be possible to do a full offload on 13B. I'm not sure if you can fully fit 4K since that is a tight call but its definately worth a try.

How can I make it easy for players playing a game in pygame that performs api calls to localhost oobabooga and extracts the generated text to include it in the game for NPCs? in c/localllama@poweruser.forum

[–] henk717@alien.top 1 points 10 months ago

I'd go the Koboldcpp route instead because its portable for them so its much simpler to install and use. Koboldcpp has API documentation available if you add /api to a working link (Or you can just check it here). If you already made it for the OpenAI compatible API stuff it supports that to.