LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Why there are quantized models in the hugging face hug? (alien.top)

submitted 2 years ago by Motylde@alien.top to c/localllama@poweruser.forum

4 comments fedilink hide all child comments

Hi. I'm using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it's just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

top 4 comments

sorted by: hot top controversial new old

[–] metaprotium@alien.top 1 points 2 years ago

Most quantized models on the hub are quantized with GPTQ / AWQ and other techniques. These techniques are optimized for inference and are faster than load_in_4bit. load_in_4bit uses the bitsandbytes library and is more useful for training LoRAs on a limited amount of VRAM.

[–] vasileer@alien.top 1 points 2 years ago

file size which impacts load time:

with load_in_4bit it will download and parse the big file (which is 4x bigger if it is bfloat16, or 8x bigger if it is float32) and then will quantize on the fly,

with pre-quantized files, it downloads only the quants, so expect a 4x to 8x faster load time for 4bit quants

[–] llama_in_sunglasses@alien.top 1 points 2 years ago

load-in-4bit takes a long time to load a model and the performance is poor in both speed and output quality.

I have compared a bunch of quant methods at https://desync.xyz/ for Mistral, llama-7b, orca2-13b if you are interested.

[–] mcmoose1900@alien.top 1 points 2 years ago

Many reasons:

AutoModelForCausalLM is extremely slow compared to other backends/quantizations, even with augmentations like BetterTransformers.
It also uses much more VRAM than other quantization, especially at high context.
Its size is inflexible.
Loads slower
No CPU offloading
Its potentially lower quality than other quantization at the same bpw