No-Belt7582

joined 1 year ago
[–] No-Belt7582@alien.top 1 points 11 months ago

You are famous everywhere for those comparisons.

[–] No-Belt7582@alien.top 1 points 11 months ago

I use kobold cpp for local llm deployment. It's clean, it's easy and allows for sliding context. Can interact with drop in replacement for OpenAI.

[–] No-Belt7582@alien.top 1 points 11 months ago

Most of the times issue is with prompt template, especially with the spaces ###instruction vs ### instruction etc.

Smaller models need good prompt, I tried with newer version of mistral 2.5 7B prompts work superbly on that.

[–] No-Belt7582@alien.top 1 points 11 months ago

Implement post-processing as pytorch model.

Then create a super model inside which you are linking both models ( model itself and post process) export this super model.

[–] No-Belt7582@alien.top 1 points 11 months ago (1 children)

How are you serving your gptq models?

 

Can someone please explain the quantisation method differences:
- AWQ
- GPTQ
- llamacpp GGUF quantisation (sorry I do not know the quantisation technique name)

As far as I have researched there is limited AI backend that supports CPU inference of AWQ and GPTQ models and GGUF quantisation (like Q_4_K_M) is prevalent because it even runs smoothly on CPU.

So:
What exactly is the quantisation difference between above techniques.