overview for No-Belt7582

Open LLM Leaderboard vs Reality: How do you evaluate "good" ? in c/localllama@poweruser.forum

[–] No-Belt7582@alien.top 1 points 11 months ago

You are famous everywhere for those comparisons.

LLM Web-UI recommendations in c/localllama@poweruser.forum

[–] No-Belt7582@alien.top 1 points 11 months ago

I use kobold cpp for local llm deployment. It's clean, it's easy and allows for sliding context. Can interact with drop in replacement for OpenAI.

Is it just me or is prompt engineering basically useless with smaller models? in c/localllama@poweruser.forum

[–] No-Belt7582@alien.top 1 points 11 months ago

Most of the times issue is with prompt template, especially with the spaces ###instruction vs ### instruction etc.

Smaller models need good prompt, I tried with newer version of mistral 2.5 7B prompts work superbly on that.

How to export scale arg with inputs (Onnx) [D] in c/machinelearning@academy.garden

[–] No-Belt7582@alien.top 1 points 11 months ago

Implement post-processing as pytorch model.

Then create a super model inside which you are linking both models ( model itself and post process) export this super model.

What is considered the best uncensored LLM right now? in c/localllama@poweruser.forum

[–] No-Belt7582@alien.top 1 points 11 months ago (1 children)

How are you serving your gptq models?

1

Quantisation techniques difference? (alien.top)

submitted 1 year ago by No-Belt7582@alien.top to c/localllama@poweruser.forum

2 comments fedilink

Can someone please explain the quantisation method differences:
- AWQ
- GPTQ
- llamacpp GGUF quantisation (sorry I do not know the quantisation technique name)

As far as I have researched there is limited AI backend that supports CPU inference of AWQ and GPTQ models and GGUF quantisation (like Q_4_K_M) is prevalent because it even runs smoothly on CPU.

So:
What exactly is the quantisation difference between above techniques.