this post was submitted on 14 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Can someone please explain the quantisation method differences:
- AWQ
- GPTQ
- llamacpp GGUF quantisation (sorry I do not know the quantisation technique name)

As far as I have researched there is limited AI backend that supports CPU inference of AWQ and GPTQ models and GGUF quantisation (like Q_4_K_M) is prevalent because it even runs smoothly on CPU.

So:
What exactly is the quantisation difference between above techniques.

you are viewing a single comment's thread
view the rest of the comments
[โ€“] Dead_Internet_Theory@alien.top 1 points 11 months ago

Yeah, EXL2 is awesome. It's kinda black magic how GPUs that were released way before ChatGPT was a twinkle in anyone's eyes can run something that can trade blows with it. I still don't get how fractional bpw is even possible. What the hell, 2.55 bits man ๐Ÿ˜‚ how does it even run after that to any degree? It's magic, that's what it is.