this post was submitted on 28 Nov 2023
1 points (100.0% liked)
LocalLLaMA
14 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Many reasons:
AutoModelForCausalLM is extremely slow compared to other backends/quantizations, even with augmentations like BetterTransformers.
It also uses much more VRAM than other quantization, especially at high context.
Its size is inflexible.
Loads slower
No CPU offloading
Its potentially lower quality than other quantization at the same bpw