LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Quantisation techniques difference? (alien.top)

submitted 2 years ago by No-Belt7582@alien.top to c/localllama@poweruser.forum

2 comments fedilink hide all child comments

Can someone please explain the quantisation method differences:
- AWQ
- GPTQ
- llamacpp GGUF quantisation (sorry I do not know the quantisation technique name)

As far as I have researched there is limited AI backend that supports CPU inference of AWQ and GPTQ models and GGUF quantisation (like Q_4_K_M) is prevalent because it even runs smoothly on CPU.

So:
What exactly is the quantisation difference between above techniques.

you are viewing a single comment's thread
view the rest of the comments

[–] Dead_Internet_Theory@alien.top 1 points 2 years ago

Yeah, EXL2 is awesome. It's kinda black magic how GPUs that were released way before ChatGPT was a twinkle in anyone's eyes can run something that can trade blows with it. I still don't get how fractional bpw is even possible. What the hell, 2.55 bits man 😂 how does it even run after that to any degree? It's magic, that's what it is.