this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hi. I'm using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it's just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

you are viewing a single comment's thread
view the rest of the comments
[–] metaprotium@alien.top 1 points 9 months ago

Most quantized models on the hub are quantized with GPTQ / AWQ and other techniques. These techniques are optimized for inference and are faster than load_in_4bit. load_in_4bit uses the bitsandbytes library and is more useful for training LoRAs on a limited amount of VRAM.