this post was submitted on 28 Nov 2023
1 points (100.0% liked)

LocalLLaMA

14 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago
MODERATORS
 

Hi. I'm using Llama-2 for my project in python with transformers library. There is an option to use quantization on any normal model:

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-13b-chat-hf",
    load_in_4bit=True,
)

If it's just a matter of single flag, and nothing is recomputed, then why there is so much already quantized models in the hub? Are they better than adding this one line?

you are viewing a single comment's thread
view the rest of the comments
[–] llama_in_sunglasses@alien.top 1 points 2 years ago

load-in-4bit takes a long time to load a model and the performance is poor in both speed and output quality.

I have compared a bunch of quant methods at https://desync.xyz/ for Mistral, llama-7b, orca2-13b if you are interested.