this post was submitted on 04 Dec 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
top 6 comments
sorted by: hot top controversial new old
[–] fediverser@alien.top 1 points 11 months ago

This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

[–] AntoItaly@alien.top 1 points 11 months ago (1 children)

Wow, with this quantization method, LLama 70B weighs only 17.5GB!

[–] iChrist@alien.top 1 points 11 months ago

Omg how can I run it on 3090?

[–] llama_in_sunglasses@alien.top 1 points 11 months ago (1 children)

To create quants of new models, one has to create Hessians for it and it uses several GB of RedPajama to calibrate these. Generating Hessians for Mistral is taking 17 minutes per LAYER on my 3090. I'll see if it can even finish later. Much later. That's over 16 hours just to quantize a 7B model, yikes.

The paper for this is one of the worst for me in years, full on "I know some of these words." I didn't think 8-dimensional sphere packing was going to be in my attempted light reading for the night.

P..S.: Rollback to transformers 4.34.0 or edit the code in hessian_offline_llama.py and change all instances of

attention_mask = model.model._prepare_decoder_attention_mask(

to

attention_mask = _prepare_4d_causal_attention_mask(

and add an import to the top of the same file.

from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
[–] llama_in_sunglasses@alien.top 1 points 11 months ago (1 children)

With Llama-2-70b-chat-E8P-2Bit from their zoo, quip# seems fairly promising. I'd have to try l2-70b-chat in exl2 at 2.4 bpw to compare but this model does not really feel like a 2 bit model so far, I'm impressed.

[–] a_beautiful_rhind@alien.top 1 points 11 months ago

From the issue about this in the exllamav2 repo, quip was using more memory and slower than exl. How much context can you fit?