this post was submitted on 15 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I posted my latest LLM Comparison/Test just yesterday, but here's another (shorter) comparison/benchmark I did while working on that - testing different formats and quantization levels.

My goal was to find out which format and quant to focus on. So I took the best 70B according to my previous tests, and re-tested that again with various formats and quants. I wanted to find out if they worked the same, better, or worse. And here's what I discovered:

Model Format Quant Offloaded Layers VRAM Used Primary Score Secondary Score Speed +mmq Speed -mmq
lizpreciatior/lzlv_70B.gguf GGUF Q4_K_M 83/83 39362.61 MB 18/18 4+3+4+6 = 17/18
lizpreciatior/lzlv_70B.gguf GGUF Q5_K_M 70/83 ! 40230.62 MB 18/18 4+3+4+6 = 17/18
TheBloke/lzlv_70B-GGUF GGUF Q2_K 83/83 27840.11 MB 18/18 4+3+4+6 = 17/18 4.20T/s 4.01T/s
TheBloke/lzlv_70B-GGUF GGUF Q3_K_M 83/83 31541.11 MB 18/18 4+3+4+6 = 17/18 4.41T/s 3.96T/s
TheBloke/lzlv_70B-GGUF GGUF Q4_0 83/83 36930.11 MB 18/18 4+3+4+6 = 17/18 4.61T/s 3.94T/s
TheBloke/lzlv_70B-GGUF GGUF Q4_K_M 83/83 39362.61 MB 18/18 4+3+4+6 = 17/18 4.73T/s !! 4.11T/s
TheBloke/lzlv_70B-GGUF GGUF Q5_K_M 70/83 ! 40230.62 MB 18/18 4+3+4+6 = 17/18 1.51T/s 1.46T/s
TheBloke/lzlv_70B-GGUF GGUF Q5_K_M 80/83 46117.50 MB OutOfMemory
TheBloke/lzlv_70B-GGUF GGUF Q5_K_M 83/83 46322.61 MB OutOfMemory
LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2 EXL2 2.4bpw 11,11 -> 22 GB BROKEN
LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2 EXL2 2.6bpw 12,11 -> 23 GB FAIL
LoneStriker/lzlv_70b_fp16_hf-3.0bpw-h6-exl2 EXL2 3.0bpw 14,13 -> 27 GB 18/18 4+2+2+6 = 14/18
LoneStriker/lzlv_70b_fp16_hf-4.0bpw-h6-exl2 EXL2 4.0bpw 18,17 -> 35 GB 18/18 4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-4.65bpw-h6-exl2 EXL2 4.65bpw 20,20 -> 40 GB 18/18 4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-5.0bpw-h6-exl2 EXL2 5.0bpw 22,21 -> 43 GB 18/18 4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-6.0bpw-h6-exl2 EXL2 6.0bpw > 48 GB TOO BIG
TheBloke/lzlv_70B-AWQ AWQ 4-bit OutOfMemory

My AI Workstation:

  • 2 GPUs (48 GB VRAM): Asus ROG STRIX RTX 3090 O24 Gaming White Edition (24 GB VRAM) + EVGA GeForce RTX 3090 FTW3 ULTRA GAMING (24 GB VRAM)
  • 13th Gen Intel Core i9-13900K (24 Cores, 8 Performance-Cores + 16 Efficient-Cores, 32 Threads, 3.0-5.8 GHz)
  • 128 GB DDR5 RAM (4x 32GB Kingston Fury Beast DDR5-6000 MHz) @ 4800 MHz ☹️
  • ASUS ProArt Z790 Creator WiFi
  • 1650W Thermaltake ToughPower GF3 Gen5
  • Windows 11 Pro 64-bit

Observations:

  • Scores = Number of correct answers to multiple choice questions of 1st test series (4 German data protection trainings) as usual
    • Primary Score = Number of correct answers after giving information
    • Secondary Score = Number of correct answers without giving information (blind)
  • Model's official prompt format (Vicuna 1.1), Deterministic settings. Different quants still produce different outputs because of internal differences.
  • Speed is from koboldcpp-1.49's stats, after a fresh start (no cache) with 3K of 4K context filled up already, with (+) or without (-) mmq option to --usecublas.
  • LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2: 2.4b-bit = BROKEN! Didn't work at all, outputting only one word and repeating that ad infinitum.
  • LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2: 2.6-bit = FAIL! Achknowledged questions like information with just OK, didn't answer unless prompted, and made mistakes despite given information.
  • Even EXL2 5.0bpw was surprisingly doing much worse than GGUF Q2_K.
  • AWQ just doesn't work for me with oobabooga's text-generation-webui, despite 2x 24 GB VRAM, it goes OOM. Allocation seems to be broken. Giving up on that format for now.
  • All versions consistently acknowledged all data input with "OK" and followed instructions to answer with just a single letter or more than just a single letter.
  • EXL2 isn't entirely deterministic. Its author said speed is more important than determinism, and I agree, but the quality loss and non-determinism make it less suitable for model tests and comparisons.

Conclusion:

  • With AWQ not working and EXL2 delivering bad quality (secondary score dropped a lot!), I'll stick to the GGUF format for further testing, for now at least.
  • Strange that bigger quants got more tokens per second than smaller ones, maybe that's because of different responses, but Q4_K_M with mmq was fastest - so I'll use that for future comparisons and tests.
  • For real-time uses like Voxta+VaM, EXL2 4-bit is better - it's fast and accurate, yet not too big (need some of the VRAM for rendering the AI's avatar in AR/VR). Feels almost as fast as unquantized Transfomers Mistral 7B, but much more accurate for function calling/action inference and summarization (it's a 70B after all).

So these are my - quite unexpected - findings with this setup. Sharing them with you all and looking for feedback if anyone has done perplexity tests or other benchmarks between formats. Is EXL2 really such a tradeoff between speed and quality in general, or could that be a model-specific effect here?


Here's a list of my previous model tests and comparisons or other related posts:


Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!

top 26 comments
sorted by: hot top controversial new old
[–] llama_in_sunglasses@alien.top 1 points 10 months ago (1 children)

GGUF k-quants are really good at making sure the most important parts of the model are not x bit but q6_k if possible. GPTQ and AWQ models can fall apart and give total bullshit at 3 bits while the same model in q2_k / q3_ks with around 3 bits usually outputs sentences.

[–] ReturningTarzan@alien.top 1 points 10 months ago

When you're using non-instruct models for instruct-type questions, prompting is everything. For comparison, here are the first three questions put to Mistral-7B-instruct with correct prompt format at various bitrates up to FP16.

[–] tgredditfc@alien.top 1 points 10 months ago (1 children)

I have 2 gpus and AWQ never works for me on Oobabooga, no matter how I split the vRAM, oom in most of the cases.

[–] thereisonlythedance@alien.top 1 points 10 months ago

I had to split it something strange like 12/24GB to make it work. Even then I couldn’t get past 3K context.

[–] CosmosisQ@alien.top 1 points 10 months ago

Hell yeah! Two days in a row! We need more people doing format comparisons and benchmarks in general. Again, thank you for all of your hard work, and keep 'em coming!

How would you say EXL2 subjectively compares to GGUF? Have you had the chance to roleplay with both formats outside of Voxta+VaM (i.e., in SillyTavern)? I ask because I'm sure the increased generation speed is more important than anything when using Voxta+VaM so it might be easier to compare their output quality in SillyTavern.

On that note, would you say you now prefer using lzlv (70B, EXL2) over OpenChat 3.5 (7B, GGUF) with Voxta+VaM?

[–] ambient_temp_xeno@alien.top 1 points 10 months ago

EXL2 5.0bpw was surprisingly doing much worse than GGUF Q2_K

Risitas.mov https://www.youtube.com/watch?v=QT13kk8HDDo

[–] panchovix@alien.top 1 points 10 months ago (4 children)

The major reason I use exl2 is speed, like on 2x4090 I get 15-20 t/s at 70b depending of the size, but GGUF I get like tops 4-5 t/s.

When using 3 gpus (2x4090+1x3090), it is 11-12 t/s at 6.55bpw vs GGUF Q6_K that runs at 2-3 t/s.

Though I agree with you, for model comparisons and such you need to have deterministic results and also the best quality.

If you can sometime, try 70b at 6bpw or more, IMO it is pretty consistent and doesn't have issues like 5bpw/bits.

The performance hit is too much on multigpu systems when using GGUF. I guess if in the future the speed gets to the same level, I would use it most of the time.

[–] a_beautiful_rhind@alien.top 1 points 10 months ago

I'm surprised you get speeds so bad with GGUF. I get almost 9t/s on P40s and 18t/s on 3090.

GGUF is actually the fastest format until you load it up with context.

A couple of things have to be changed in cmakelists under vendor/llama.cpp if you're using python

set(LLAMA_CUDA_MMV_Y        "2" CACHE STRING "llama: y block size for mmv CUDA kernels")
option(LLAMA_CUDA_FORCE_MMQ                  "llama: use mmq kernels instead of cuBLAS"         ON)

I have nvlink so this helps me. Since you don't it still may help using direct communication via PCIE:

set(LLAMA_CUDA_PEER_MAX_BATCH_SIZE "8192" CACHE STRING

and since you're using all new cards:

option(LLAMA_CUDA_F16                        "llama: use 16 bit floats for some calculations"   OFF)

Try out the FP16 support.

[–] easyllaama@alien.top 1 points 10 months ago

β€˜The performance hit is too much on multigpu systems when using GGUF’

I agree. GGuF has multi GPU panelty. But it”s the most friendly to Apple silicons. I have same setup with you. one 4090 can run Xwin 13b at 40t/s. but when 2 cards present, it get only 1/4 of speed at 10t/s. So to get it fast, I have to flag CUDA device to single card while 2 cards present.

Since GGUF liks single GPU, those who have 3090/4090 will find 34B the best spot with the format.

[–] candre23@alien.top 1 points 10 months ago

GGUF I get like tops 4-5 t/s.

You're doing something very wrong. I get better speeds than that on P40s with low context. Are you not using cublas?

[–] bullerwins@alien.top 1 points 10 months ago

What motherboard do you have that can run 3x GPU's?

[–] kpodkanowicz@alien.top 1 points 10 months ago

Great work as always! Regarding Exl2 its sensitive to calibration dataset - probably the one that was used is not related to your tests. I.e. you can get higher scores in HumanEval even in 3 bits that you would get in transformers 8bit. I hope that this standard will get more popular and finetuners will do their own measurement file/quants using their dataset. Never seen q2 gguf doing better than exl2 unless i mixed rope config.

Edit - for anything higher than 4.25bit i usually use 8bit head

[–] Alternative_Case_878@alien.top 1 points 10 months ago

Steve jobs would have been a better president than Donald Trump. But it's a silly comparison really, it's like comparing apples to oranges.

[–] Aaaaaaaaaeeeee@alien.top 1 points 10 months ago

on 2.Xbpw, untick "add bos_token" avoiding the "cord string builder" looping

[–] ReMeDyIII@alien.top 1 points 10 months ago

For real-time uses like Voxta+VaM, EXL2 4-bit is better

Wow, I didn't expect to see a Virt-a-Mate reference. You left no stone unturned and are doing God's work.

[–] Unequaled@alien.top 1 points 10 months ago

/u/WolframRavenwolf

Honestly, ever since I saw someone mention that with EXL2 I could run a 70b model on a single 4090/3090/24 VRAM I was instantly hooked. Especially since enabling the 8bit cache option meant you could run even higher context sizes albeit 2x more sometimes.

The main advantage as you mention is speed. As a RP'er myself, I care somewhat less about quality responses. Speed is king in my opinion since you can always swipe for more alternative responses. It's very hard to let go of 20-30 T/s vs <5 T/s on GGUF. 😭

Baseline of 70b is good enough to justify the tradeoff of quality. Besides, I don't have to buy ANOTHER 4090 to run 70b models.

Personally, I run waldie_lzlv-limarpv3-l2-70b-2.4bpw-h6-exl2 version of lzlv. It isn't broken for 1 and it seems to give somewhat better and creative responses.

Side note: Did you notice in Nous Cabybara 34b that spelling mistakes or weird sentences would form in longer contexts? Because sometimes I would get weird non-sensical sentences or stuff like I'll' or even a Chinese character

[–] nsfw_throwitaway69@alien.top 1 points 10 months ago

I wasn't aware that Exl2 had issues with quality. Your tests seem to suggest that equivalent bpw in Exl2 produce worse results than in GGUF. I wonder why that is.

[–] Worldly-Mistake-8147@alien.top 1 points 10 months ago

I'm probably going to ask something extremely basic, but why GPTQ isn't an option? With OP's double GPU he can run 4bit 32g with 8k context, and I was under impression that the quality loss is barely noticeable. Though I noticed it absolutely messes up numbers (math, or historical dates).

[–] DataPhreak@alien.top 1 points 10 months ago

The speeds don't really surprise me. They're going to take longer to load, but the math is about the same once they're stood up.

[–] w4ldfee@alien.top 1 points 10 months ago

i run lzlv 2.4bpw without problems. make sure to disable bos token, then it should work way better.

[–] Ycros@alien.top 1 points 10 months ago

It may be interesting to anyone running models across 2 3090s that in llama.cpp/koboldcpp there's a performance increase if your two GPUs support peering with one another (check with nvidia-smi topo -p2p r) - it wasn't working with my particular motherboard, so I installed an nvlink bridge and got a performance bump in token generation (an extra 10-20% with 70b, more with smaller models, except smaller models go much faster if you can fit them on one gpu).

I have no idea what the performance diff is between having a bridge and peering via pci-e if your system supports it. I also tested exl2 and there was no difference as I don't think it implements any sort of peering optimisations.

[–] permalip@alien.top 1 points 10 months ago (1 children)

FYI, AWQ released 0.1.7 that fixes multi-GPU. Should alleviate OOM issues on multi-GPU, which became broken with newer versions of Huggingface libraries.

https://github.com/casper-hansen/AutoAWQ/releases/tag/v0.1.7

[–] WolframRavenwolf@alien.top 1 points 10 months ago

Oh, great news, once that's in ooba, I'll give it another try.

[–] lone_striker@alien.top 1 points 10 months ago (1 children)

For the 2.4bpw and 2.6bpw exl2 models, you have to change a setting in ooba to get them to generate coherent text. Disable this setting:

Add the bos_token to the beginning of prompts

https://preview.redd.it/4v8m7ciu0y1c1.png?width=356&amp;format=png&amp;auto=webp&amp;s=785837b8466a3bcda3e49477424b7c377a8d542f

The very low bpw models need the above setting as well as being more strict with the prompt format. The higher bpw models are more flexible and can deal with prompt formats they were not specifically tuned for.

I would also set the VRAM for 2.4 to use only a single GPU. Spreading them out over two GPUs is not needed and will slow them down. That's the main reason I generate 2.4 (and 2.6bpw) versions is to allow people with only a single 3090 or 4090 to run 70B models at full speeds. Though obviously quality will be lower than the higher-bit models. For 2.6bpw to fit on a single 24 GB VRAM GPU, you will need to enable the cache_8bit option.

[–] WolframRavenwolf@alien.top 1 points 10 months ago

Does 8-bit cache reduce quality or speed or what's the disadvantage of it? (If it had none, it would be default, I assume.)

[–] ChiefBigFeather@alien.top 1 points 10 months ago

This is difficult to evaluate. It could be that exl2 just breaks the translation layer.