I use kobold cpp for local llm deployment. It's clean, it's easy and allows for sliding context. Can interact with drop in replacement for OpenAI.
No-Belt7582
joined 1 year ago
Most of the times issue is with prompt template, especially with the spaces ###instruction vs ### instruction etc.
Smaller models need good prompt, I tried with newer version of mistral 2.5 7B prompts work superbly on that.
Implement post-processing as pytorch model.
Then create a super model inside which you are linking both models ( model itself and post process) export this super model.
How are you serving your gptq models?
You are famous everywhere for those comparisons.