this post was submitted on 14 Nov 2023
1 points (100.0% liked)
LocalLLaMA
3 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Among other things, GPTQ, GGUF' K-Quants, and bitsandbytes FP4 are relatively "easy" quantization. Not to discount them... They are very sophisticated, but models can be quantized very quickly with them.
EXL2 an AWQ are much more intense. You feed them profiling data, text you want to use as a reference to optimize the quantization towards that. And the quantization takes forever, and requires a lot of GPU. But the quantized weights you get out of them are very VRAM efficient.
Yeah, EXL2 is awesome. It's kinda black magic how GPUs that were released way before ChatGPT was a twinkle in anyone's eyes can run something that can trade blows with it. I still don't get how fractional bpw is even possible. What the hell, 2.55 bits man ๐ how does it even run after that to any degree? It's magic, that's what it is.