LocalLLaMA

4 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

load llama-2 in 8b quantization? (alien.top)

submitted 2 years ago by peterwu00@alien.top to c/localllama@poweruser.forum

3 comments fedilink hide all child comments

The question is probably too basic. But how do i load llama2 70B model using 8b quantization? I see TheBlokeLlama2_70B_chat_GPTQ but they only show 3b/4b quantization. I have 80G A100 and try to load llama2 70B model with 8b quantization. Thanks a lot!

top 3 comments

sorted by: hot top controversial new old

[–] vec1nu@alien.top 1 points 2 years ago (1 children)

I haven't used gptq in a while, but i can say that gguf has 8 bit quantization, which you can use with llamacpp. Furthermore, if you use the original huggingface models, the ones which you load using the transformers loader, you have options in there to load in either 8 or 4bit.

[–] peterwu00@alien.top 1 points 2 years ago

thanks!

[–] mcmoose1900@alien.top 1 points 2 years ago

Grab the original (fp16) models. They are quantized to 8-bit on the fly with bitsandbytes.