LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Models Megathread #2 - What models are you currently using? (alien.top)

submitted 2 years ago by Technical_Leather949@alien.top to c/localllama@poweruser.forum

56 comments fedilink hide all child comments

As requested, this is the subreddit's second megathread for model discussion. This thread will now be hosted at least once a month to keep the discussion updated and help reduce identical posts.

I also saw that we hit 80,000 members recently! Thanks to every member for joining and making this happen.

Welcome to the r/LocalLLaMA Models Megathread

What models are you currently using and why? Do you use 7B, 13B, 33B, 34B, or 70B? Share any and all recommendations you have!

Examples of popular categories:

Assistant chatting
Chatting
Coding
Language-specific
Misc. professional use
Role-playing
Storytelling
Visual instruction

Have feedback or suggestions for other discussion topics? All suggestions are appreciated and can be sent to modmail.

^(P.S. LocalLLaMA is looking for someone who can manage Discord. If you have experience modding Discord servers, your help would be welcome. Send a message if interested.)

Previous Thread | New Models

you are viewing a single comment's thread
view the rest of the comments

[–] CasimirsBlake@alien.top 1 points 2 years ago (4 children)

A few folks mentioning EXL2 here. Is this now the preferred Exllama format over GPTQ?

[–] sophosympatheia@alien.top 1 points 2 years ago (1 children)

EXL2 runs fast and the quantization process implements some fancy logic behind the scenes to do something similar to k_m quants for GGUF models. Instead of quantizing every slice of the model to the same bits per weight (bpw), it determines which slices are more important and uses a higher bpw for those slices and a lower bpw for the less-important slices where the effects of quantization won't matter as much. The result is the average bits per weight across all the layers works out to be what you specified, say 4.0 bits per weight, but the performance hit to the model is less severe than its level of quantization would suggest because the important layers are maybe 5.0 bpw or 5.5 bpw, something like that.

In short, EXL2 quants tend to punch above their weight class due to some fancy logic going on behind the scenes.

[–] CasimirsBlake@alien.top 1 points 2 years ago

Thank you! I'm reminded of variable bit rate encoding used in various audio and video formats, this sounds not dissimilar.

[–] TheMightyCatt@alien.top 1 points 2 years ago

EXL2 provides more options and has a smaller quality decrease for as far as I know.

[–] mcmoose1900@alien.top 1 points 2 years ago

In addition to what others said, exl2 is very sensitive to the quantization dataset, which it uses to choose where to assign those "variable" bits.

Most online quants use wikitext. But I believe if you quantize models yourself on your own chats, you can get better results, especially below 4bpw.

[–] Biggest_Cans@alien.top 1 points 2 years ago

I won't use anything else for GPU processing.

The quality bump I've seen for my 4090 is very noticeable in speed, coherence and context.

Wild to me that thebloke doesn't ever use it.

Easy enough to find quants though if you just go to models and search "exl2" and sort by whatever.