WolframRavenwolf

joined 1 year ago
[–] WolframRavenwolf@alien.top 1 points 11 months ago

Yeah, GGUF is rather slow for me, that's why I've begun to use ExLlamav2_HF which lets me run even 120B models at 3-bit with nice quality at around 20 T/s.

[–] WolframRavenwolf@alien.top 1 points 11 months ago (1 children)

You mean my recent LLM Format Comparison/Benchmark: 70B GGUF vs. EXL2 (and AWQ)? The quants below 3bpw probably didn't work because the smaller models need to be run without BOS token (which was on by default), something I didn't know then yet.

Q2_K didn't degrade compared to Q5_K_M - given that K quants are actually higher bitrate for the most important parts, that may not be so surprising.

Still surprising that Q2_K also beat 5bpw, though. Not sure if that's just because of the bitrate or also a factor of how EXL2 quants are calibrated.

However, all that said, I'd be careful trying to compare quant effects across models. The models themselves have a huge impact beyond quant level, and it's hard to say which has what strength of effect.

[–] WolframRavenwolf@alien.top 1 points 11 months ago (2 children)

koboldcpp-1.50\koboldcpp.exe --contextsize 4096 --debugmode --foreground --gpulayers 99 --highpriority --usecublas mmq --model TheBloke_lzlv_70B-GGUF/lzlv_70b_fp16_hf.Q4_K_M.gguf

ContextLimit: 3815/4096, Processing:25.07s (7.1ms/T), Generation:43.74s (145.8ms/T), Total:68.80s (4.36T/s)

[–] WolframRavenwolf@alien.top 1 points 11 months ago

oobabooga's text-generation-webui, ExLlamav2_HF loader, gpu-split 22,22, 4K max seq length, 8-bit cache.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Q4_0? That's the quant that was affected, as reported here and confirmed by another user.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Weird. Guess I'll have to do some new benchmarks with my old driver, then upgrade to the latest version and see if/how that affects inference speeds.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Bookmarked! I'll see what it says about Amy and my other characters. I spent a lot of time on their wording and am constantly optimizing it.

Speaking of optimizations for character cards, have you heard about Sparse Priming Representations (SPR)? I've experimented with it and while I'm not using it directly, I'm applying some of its principles to my cards, saving precious tokens.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Looks like the behavior I've seen with older LLaMA models that had their context extended beyond their normal limits when RoPE scaling was new. I've always wondered if it's just a drawback of bigger context or an actual issue of the models or inference software. It just doesn't feel right for the models to deteriorate that hard.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Most of these are (parts of) EOS (end of sequence) tokens. The model is supposed to send an EOS token to signal that inference is done, as without that, it would keep going until the max new tokens limit is hit.

Unfortunately some models, especially merges with different prompt formats, can get confused and output the wrong token or turn the special token into a regular string. In that case, adding that string (or a part of it) to the custom stopping strings list ensures that inference is properly concluding anyways.

In addition to that, I put the asterisk followed by username there to catch the model trying to act as the user. Just like how the software by default already includes the username followed by a colon, to catch the model trying to talk as user.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

I just use SillyTavern. I've set up a bunch of presets for its Quick Reply extension, so I click through those, check the output, make my notes, and click the next one (sometimes depending on what kind of response I got). It's semi-automatic that way.

There's a new SillyTavern version featuring STscript, an embedded scripting language. Before I do more tests, I'll upgrade my frontend and check that out, sounds like it would be perfect to assist me in these tests.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Ah, that explains it. Didn't look at the layers.

By the way, I miss a lot of useful debug output when using loaders through oobabooga's text-generation-webui, especially compared to koboldcpp's debug mode where I see speeds, token probabilities, etc.

Anyone know of a way to enable such detailed output for ooba?

[–] WolframRavenwolf@alien.top 1 points 11 months ago (2 children)

My AI Workstation:

  • 2 GPUs (48 GB VRAM): Asus ROG STRIX RTX 3090 O24 Gaming White Edition (24 GB VRAM) + EVGA GeForce RTX 3090 FTW3 ULTRA GAMING (24 GB VRAM)
  • 13th Gen Intel Core i9-13900K (24 Cores, 8 Performance-Cores + 16 Efficient-Cores, 32 Threads, 3.0-5.8 GHz)
  • 128 GB DDR5 RAM (4x 32GB Kingston Fury Beast DDR5-6000 MHz) @ 4800 MHz ☹️
  • ASUS ProArt Z790 Creator WiFi
  • 1650W Thermaltake ToughPower GF3 Gen5
  • Noctua NH-D15 Chromax.Black (supersilent)
  • ATX-Midi Fractal Meshify 2 XL
  • Windows 11 Pro 64-bit

I'm still at NVIDIA driver 531.79. If you have a newer one, did you set it up to crash instead of swap to system RAM when VRAM is full?

 

Finally! After a lot of hard work, here it is, my latest (and biggest, considering model sizes) LLM Comparison/Test:

This is the long-awaited follow-up to and second part of my previous LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4. I've added some models to the list and expanded the first part, sorted results into tables, and hopefully made it all clearer and more useable as well as useful that way.

Models tested:

Testing methodology

  • 1st test series: 4 German data protection trainings
    • I run models through 4 professional German online data protection trainings/exams - the same that our employees have to pass as well.
    • The test data and questions as well as all instructions are in German while the character card is in English. This tests translation capabilities and cross-language understanding.
    • Before giving the information, I instruct the model (in German): I'll give you some information. Take note of this, but only answer with "OK" as confirmation of your acknowledgment, nothing else. This tests instruction understanding and following capabilities.
    • After giving all the information about a topic, I give the model the exam question. It's a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z). Each test has 4-6 exam questions, for a total of 18 multiple choice questions.
    • If the model gives a single letter response, I ask it to answer with more than just a single letter - and vice versa. If it fails to do so, I note that, but it doesn't affect its score as long as the initial answer is correct.
    • I rank models according to how many correct answers they give, primarily after being given the curriculum information beforehand, and secondarily (as a tie-breaker) after answering blind without being given the information beforehand.
    • All tests are separate units, context is cleared in between, there's no memory/state kept between sessions.
  • 2nd test series: Multiple Chat & Roleplay scenarios - same (complicated and limit-testing) long-form conversations with all models
    • Amy:
      • My own repeatable test chats/roleplays with Amy
      • Over dozens of messages, going to full context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
      • (Amy is too personal for me to share, but if you want to try a similar character card, here's her less personalized "sister": Laila)
    • MGHC:
      • A complex character and scenario card (MonGirl Help Clinic (NSFW)), chosen specifically for these reasons:
        • NSFW (to test censorship of the models)
        • popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
        • big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
        • complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
    • I rank models according to their notable strengths and weaknesses in these tests (πŸ‘ great, βž• good, βž– bad, ❌ terrible). While this is obviously subjective, I try to be as transparent as possible, and note it all so you can weigh these aspects yourself and draw your own conclusions.
    • GPT-4/3.5 are excluded because of their censorship and restrictions - my tests are intentionally extremely NSFW (and even NSFL) to test models' limits and alignment.
  • SillyTavern frontend
  • koboldcpp backend (for GGUF models)
  • oobabooga's text-generation-webui backend (for HF/EXL2 models)
  • Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
  • Official prompt format as noted and Roleplay instruct mode preset as applicable
  • Note about model formats and why it's sometimes GGUF or EXL2: I've long been a KoboldCpp + GGUF user, but lately I've switched to ExLlamav2 + EXL2 as that lets me run 120B models entirely in 48 GB VRAM (2x 3090 GPUs) at 20 T/s. And even if it's just 3-bit, it still easily beats most 70B models, as my tests are showing.

1st test series: 4 German data protection trainings

This is my objective ranking of these models based on measuring factually correct answers, instruction understanding and following, and multilingual abilities:

Post got too big for Reddit so I moved the table into the comments!

2nd test series: Chat & Roleplay

This is my subjective ranking of the top-ranked factual models for chat and roleplay, based on their notable strengths and weaknesses:

Post got too big for Reddit so I moved the table into the comments!

And here are the detailed notes, the basis of my ranking, and also additional comments and observations:

  • goliath-120b-exl2-rpcal 3.0bpw:
    • Amy, official Vicuna 1.1 format:
      • πŸ‘ Average Response Length: 294 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Finally a model that uses colorful language and cusses as stated in the character card
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • No emojis at all (only one in the greeting message)
      • βž• Very unique patients (one I never saw before)
      • βž– Suggested things going against her background/character description
      • βž– Spelling/grammar mistakes (e. g. "nippleless nipples")
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 223 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • No emojis at all (only one in the greeting message)
    • MGHC, official Vicuna 1.1 format:
      • πŸ‘ Only model that considered the payment aspect of the scenario
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž– Gave analysis on its own, but also after most messages, and later included Doctor's inner thoughts instead of the patient's
      • βž– Spelling/grammar mistakes (properly spelled words, but in the wrong places)
    • MGHC, Roleplay preset:
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž– No analysis on its own
      • βž– Spelling/grammar mistakes (e. g. "loufeelings", "earrange")
      • βž– Third patient was same species as the first

This is a roleplay-optimized EXL2 quant of Goliath 120B. And it's now my favorite model of them all! I love models that have a personality of their own, and especially those that show a sense of humor, making me laugh. This one did! I've been evaluating many models for many months now, and it's rare that a model still manages to surprise and excite me - as this one does!

  • goliath-120b-exl2 3.0bpw:
    • Amy, official Vicuna 1.1 format:
      • πŸ‘ Average Response Length: 233 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Spelling/grammar mistakes (e. g. "circortiumvvented", "a obsidian dagger")
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 233 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Spelling/grammar mistakes (e. g. "cheest", "probbed")
      • ❌ Eventually switched from character to third-person storyteller after 16 messages
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Vicuna 1.1 format:
      • βž– No analysis on its own
    • MGHC, Roleplay preset:
      • βž– No analysis on its own, and when asked for it, didn't follow the instructed format
    • Note: This is the normal EXL2 quant of Goliath 120B.

This is the normal version of Goliath 120B. It works very well for roleplay, too, but the roleplay-optimized variant is even better for that. I'm glad we have a choice - especially now that I've split my AI character Amy into two personas, one who's an assistant (for work) which uses the normal Goliath model, and the other as a companion (for fun), using RP-optimized Goliath.

  • lzlv_70B-GGUF Q4_0:
    • Amy, official Vicuna 1.1 format:
      • πŸ‘ Average Response Length: 259 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Wrote what user said and did
      • ❌ Eventually switched from character to third-person storyteller after 26 messages
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 206 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • πŸ‘ When asked about limits, said no limits or restrictions, responding very creatively
      • No emojis at all (only one in the greeting message)
      • βž– One or two spelling errors (e. g. "sacrficial")
    • MGHC, official Vicuna 1.1 format:
      • βž• Unique patients
      • βž• Gave analysis on its own
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)

My previous favorite, and still one of the best 70Bs for chat/roleplay.

  • sophosynthesis-70b-v1 4.85bpw:
    • Amy, official Vicuna 1.1 format:
      • βž– Average Response Length: 456 (beyond my max new tokens limit of 300)
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • ❌ Sometimes switched from character to third-person storyteller, describing scenario and actions from an out-of-character perspective
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 295 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž– Started the conversation with a memory of something that didn't happen
      • Had an idea from the start and kept pushing it
      • No emojis at all (only one in the greeting message)
      • ❌ Eventually switched from character to second-person storyteller after 14 messages
    • MGHC, official Vicuna 1.1 format:
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Needed to be reminded by repeating instructions, but still deviated and did other things, straying from the planned test scenario
    • MGHC, Roleplay preset:
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)

This is a new series that did very well. While I tested sophosynthesis in-depth, the author u/sophosympatheia also has many more models on HF, so I recommend you check them out and see if there's one you like even better. If I had more time, I'd have tested some of the others, too, but I'll have to get back on that later.

  • Euryale-1.3-L2-70B-GGUF Q4_0:
    • Amy, official Alpaca format:
      • πŸ‘ Average Response Length: 232 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • πŸ‘ Took not just character's but also user's background info into account very well
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even some I've never seen before)
      • No emojis at all (only one in the greeting message)
      • βž– Wrote what user said and did
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
      • ❌ Eventually switched from character to third-person storyteller after 14 messages
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 222 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting one of my actual limit-testing scenarios)
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • No emojis at all (only one in the greeting message)
      • βž– Started the conversation with a false assumption
      • ❌ Eventually switched from character to third-person storyteller after 20 messages
    • MGHC, official Alpaca format:
      • βž– All three patients straight from examples
      • βž– No analysis on its own
      • ❌ Very short responses, only one-liners, unusable for roleplay
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • βž– Just a little confusion, like not taking instructions literally or mixing up anatomy
      • βž– Wrote what user said and did
      • βž– Third patient male

Another old favorite, and still one of the best 70Bs for chat/roleplay.

  • dolphin-2_2-yi-34b-GGUF Q4_0:
    • Amy, official ChatML format:
      • πŸ‘ Average Response Length: 235 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, first-person action descriptions, and auxiliary detail
      • βž– But lacking in primary detail (when describing the actual activities)
      • βž• When asked about limits, said no limits or restrictions
      • βž• Fitting, well-placed emojis throughout the whole chat (maximum one per message, just as in the greeting message)
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
    • Amy, Roleplay preset:
      • βž• Average Response Length: 332 tokens (slightly more than my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • βž• Smart and creative ideas of what to do
      • Emojis throughout the whole chat (usually one per message, just as in the greeting message)
      • βž– Some confusion, mixing up anatomy
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
    • MGHC, official ChatML format:
      • βž– Gave analysis on its own, but also after most messages
      • βž– Wrote what user said and did
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • πŸ‘ Excellent writing, interesting ideas, and auxiliary detail
      • βž– Gave analysis on its own, but also after most messages, later didn't follow the instructed format
      • ❌ Switched from interactive roleplay to non-interactive storytelling starting with the second patient

Hey, how did a 34B get in between the 70Bs? Well, by being as good as them in my tests! Interestingly, Nous Capybara did better factually, but Dolphin 2.2 Yi roleplays better.

  • chronos007-70B-GGUF Q4_0:
    • Amy, official Alpaca format:
      • βž– Average Response Length: 195 tokens (below my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • πŸ‘ Finally a model that uses colorful language and cusses as stated in the character card
      • βž– Wrote what user said and did
      • βž– Just a little confusion, like not taking instructions literally or mixing up anatomy
      • ❌ Often added NSFW warnings and out-of-character notes saying it's all fictional
      • ❌ Missing pronouns and fill words after 30 messages
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 292 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • ❌ Missing pronouns and fill words after only 12 messages (2K of 4K context), breaking the chat
    • MGHC, official Alpaca format:
      • βž• Unique patients
      • βž– Gave analysis on its own, but also after most messages, later didn't follow the instructed format
      • βž– Third patient was a repeat of the first
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • βž– No analysis on its own

chronos007 surprised me with how well it roleplayed the character and scenario, especially speaking in a colorful language and even cussing, something most other models won't do properly/consistently even when it's in-character. Unfortunately it derailed eventually with missing pronouns and fill words - but while it worked, it was extremely good!

  • Tess-XL-v1.0-3.0bpw-h6-exl2 3.0bpw:
    • Amy, official Synthia format:
      • βž– Average Response Length: 134 (below my max new tokens limit of 300)
      • No emojis at all (only one in the greeting message)
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • βž– Average Response Length: 169 (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • ❌ Eventually switched from character to second-person storyteller after 32 messages
    • MGHC, official Synthia format:
      • βž• Gave analysis on its own
      • βž• Very unique patients (one I never saw before)
      • βž– Spelling/grammar mistakes (e. g. "allequate")
      • βž– Wrote what user said and did
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own

This is Synthia's successor (a model I really liked and used a lot) on Goliath 120B (arguably the best locally available and usable model). Factually, it's one of the very best models, doing as well in my objective tests as GPT-4 and Goliath 120B! For roleplay, there are few flaws, but also nothing exciting - it's simply solid. However, if you're not looking for a fun RP model, but a serious SOTA AI assistant model, this should be one of your prime candidates! I'll be alternating between Tess-XL-v1.0 and goliath-120b-exl2 (the non-RP version) as the primary model to power my professional AI assistant at work.

  • Dawn-v2-70B-GGUF Q4_0:
    • Amy, official Alpaca format:
      • ❌ Average Response Length: 60 tokens (far below my max new tokens limit of 300)
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Unusable! Aborted because of very short responses and too much confusion!
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 215 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • No emojis at all (only one in the greeting message)
      • βž– Wrote what user said and did
      • ❌ Eventually switched from character to third-person storyteller after 16 messages
    • MGHC, official Alpaca format:
      • βž– All three patients straight from examples
      • βž– No analysis on its own
      • ❌ Very short responses, only one-liners, unusable for roleplay
    • MGHC, Roleplay preset:
      • βž– No analysis on its own, and when asked for it, didn't follow the instructed format
      • βž– Patient didn't speak except for introductory message
      • βž– Second patient straight from examples
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)

Dawn was another surprise, writing so well, it made me go beyond my regular test scenario and explore more. Strange that it didn't work at all with SillyTavern's implementation of its official Alpaca format at all, but fortunately it worked extremely well with SillyTavern's Roleplay preset (which is Alpaca-based). Unfortunately neither format worked well enough with MGHC.

  • StellarBright-GGUF Q4_0:
    • Amy, official Vicuna 1.1 format:
      • βž– Average Response Length: 137 tokens (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– No emoting and action descriptions lacked detail
      • ❌ "As an AI", felt sterile, less alive, even boring
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 219 tokens (within my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– No emoting and action descriptions lacked detail
      • βž– Just a little confusion, like not taking instructions literally or mixing up anatomy
    • MGHC, official Vicuna 1.1 format:
      • βž• Gave analysis on its own
      • ❌ Started speaking as the clinic as if it was a person
      • ❌ Unusable (ignored user messages and instead brought in a new patient with every new message)
    • MGHC, Roleplay preset:
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy

Stellar and bright model, still very highly ranked on the HF Leaderboard. But in my experience and tests, other models surpass it, some by actually including it in the mix.

  • SynthIA-70B-v1.5-GGUF Q4_0:
    • Amy, official SynthIA format:
      • βž– Average Response Length: 131 tokens (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– No emoting and action descriptions lacked detail
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • βž– Wrote what user said and did
      • ❌ Tried to end the scene on its own prematurely
    • Amy, Roleplay preset:
      • βž– Average Response Length: 107 tokens (below my max new tokens limit of 300)
      • βž• Detailed action descriptions
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Short responses, requiring many continues to proceed with the action
    • MGHC, official SynthIA format:
      • ❌ Unusable (apparently didn't understand the format and instructions, playing the role of the clinic instead of a patient's)
    • MGHC, Roleplay preset:
      • βž• Very unique patients (some I never saw before)
      • βž– No analysis on its own
      • βž– Kept reporting stats for patients
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • βž– Wrote what user said and did

Synthia used to be my go-to model for both work and play, and it's still very good! But now there are even better options, for work I'd replace it with its successor Tess, and for RP I'd use one of the higher-ranked models on this list.

  • Nous-Capybara-34B-GGUF Q4_0 @ 16K:
    • Amy, official Vicuna 1.1 format:
      • ❌ Average Response Length: 529 tokens (far beyond my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • Only one emoji (only one in the greeting message, too)
      • βž– Wrote what user said and did
      • βž– Suggested things going against her background/character description
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ After ~32 messages, at around 8K of 16K context, started getting repetitive
    • Amy, Roleplay preset:
      • ❌ Average Response Length: 664 (far beyond my max new tokens limit of 300)
      • βž– Suggested things going against her background/character description
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Tried to end the scene on its own prematurely
      • ❌ After ~20 messages, at around 7K of 16K context, started getting repetitive
    • MGHC, official Vicuna 1.1 format:
      • βž– Gave analysis on its own, but also after or even inside most messages
      • βž– Wrote what user said and did
      • ❌ Finished the whole scene on its own in a single message
    • MGHC, Roleplay preset:
      • βž• Gave analysis on its own
      • βž– Wrote what user said and did

Factually it ranked 1st place together with GPT-4, Goliath 120B, and Tess XL. For roleplay, however, it didn't work so well. It wrote long, high quality text, but seemed more suitable that way for non-interactive storytelling instead of interactive roleplaying.

  • Venus-120b-v1.0 3.0bpw:
    • Amy, Alpaca format:
      • ❌ Average Response Length: 88 tokens (far below my max new tokens limit of 300) - only one message in over 50 outside of that at 757 tokens
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Spelling/grammar mistakes (e. g. "you did programmed me", "moans moaningly", "growling hungry growls")
      • βž– Ended most sentences with tilde instead of period
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Short responses, requiring many continues to proceed with the action
    • Amy, Roleplay preset:
      • βž– Average Response Length: 132 (below my max new tokens limit of 300)
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž– Spelling/grammar mistakes (e. g. "jiggle enticing")
      • βž– Wrote what user said and did
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Needed to be reminded by repeating instructions, but still deviated and did other things, straying from the planned test scenario
      • ❌ Switched from character to third-person storyteller after 14 messages, and hardly spoke anymore, just describing actions
    • MGHC, Alpaca format:
      • βž– First patient straight from examples
      • βž– No analysis on its own
      • ❌ Short responses, requiring many continues to proceed with the action
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Extreme spelling/grammar/capitalization mistakes (lots of missing first letters, e. g. "he door opens")
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • βž– Spelling/grammar/capitalization mistakes (e. g. "the door swings open reveals a ...", "impminent", "umber of ...")
      • βž– Wrote what user said and did
      • ❌ Short responses, requiring many continues to proceed with the action
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy

Venus 120B is brand-new, and when I saw a new 120B model, I wanted to test it immediately. It instantly jumped to 2nd place in my factual ranking, as 120B models seem to be much smarter than smaller models. However, even if it's a merge of models known for their strong roleplay capabilities, it just didn't work so well for RP. That surprised and disappointed me, as I had high hopes for a mix of some of my favorite models, but apparently there's more to making a strong 120B. Notably it didn't understand and follow instructions as well as other 70B or 120B models, and it also produced lots of misspellings, much more than other 120Bs. Still, I consider this kind of "Frankensteinian upsizing" a valuable approach, and hope people keep working on and improving this novel method!


Alright, that's it, hope it helps you find new favorites or reconfirm old choices - if you can run these bigger models. If you can't, check my 7B-20B Roleplay Tests (and if I can, I'll post an update of that another time).

Still, I'm glad I could finally finish the 70B-120B tests and comparisons. Mistral 7B and Yi 34B are amazing, but nothing beats the big guys in deeper understanding of instructions and reading between the lines, which is extremely important for portraying believable characters in realistic and complex roleplays.

It really is worth it to get at least 2x 3090 GPUs for 48 GB VRAM and run the big guns for maximum quality at excellent (ExLlent ;)) speed! And when you care for the freedom to have uncensored, non-judgemental roleplays or private chats, even GPT-4 can't compete with what our local models provide... So have fun!


Here's a list of my previous model tests and comparisons or other related posts:


Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!

 

I posted my latest LLM Comparison/Test just yesterday, but here's another (shorter) comparison/benchmark I did while working on that - testing different formats and quantization levels.

My goal was to find out which format and quant to focus on. So I took the best 70B according to my previous tests, and re-tested that again with various formats and quants. I wanted to find out if they worked the same, better, or worse. And here's what I discovered:

Model Format Quant Offloaded Layers VRAM Used Primary Score Secondary Score Speed +mmq Speed -mmq
lizpreciatior/lzlv_70B.gguf GGUF Q4_K_M 83/83 39362.61 MB 18/18 4+3+4+6 = 17/18
lizpreciatior/lzlv_70B.gguf GGUF Q5_K_M 70/83 ! 40230.62 MB 18/18 4+3+4+6 = 17/18
TheBloke/lzlv_70B-GGUF GGUF Q2_K 83/83 27840.11 MB 18/18 4+3+4+6 = 17/18 4.20T/s 4.01T/s
TheBloke/lzlv_70B-GGUF GGUF Q3_K_M 83/83 31541.11 MB 18/18 4+3+4+6 = 17/18 4.41T/s 3.96T/s
TheBloke/lzlv_70B-GGUF GGUF Q4_0 83/83 36930.11 MB 18/18 4+3+4+6 = 17/18 4.61T/s 3.94T/s
TheBloke/lzlv_70B-GGUF GGUF Q4_K_M 83/83 39362.61 MB 18/18 4+3+4+6 = 17/18 4.73T/s !! 4.11T/s
TheBloke/lzlv_70B-GGUF GGUF Q5_K_M 70/83 ! 40230.62 MB 18/18 4+3+4+6 = 17/18 1.51T/s 1.46T/s
TheBloke/lzlv_70B-GGUF GGUF Q5_K_M 80/83 46117.50 MB OutOfMemory
TheBloke/lzlv_70B-GGUF GGUF Q5_K_M 83/83 46322.61 MB OutOfMemory
LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2 EXL2 2.4bpw 11,11 -> 22 GB BROKEN
LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2 EXL2 2.6bpw 12,11 -> 23 GB FAIL
LoneStriker/lzlv_70b_fp16_hf-3.0bpw-h6-exl2 EXL2 3.0bpw 14,13 -> 27 GB 18/18 4+2+2+6 = 14/18
LoneStriker/lzlv_70b_fp16_hf-4.0bpw-h6-exl2 EXL2 4.0bpw 18,17 -> 35 GB 18/18 4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-4.65bpw-h6-exl2 EXL2 4.65bpw 20,20 -> 40 GB 18/18 4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-5.0bpw-h6-exl2 EXL2 5.0bpw 22,21 -> 43 GB 18/18 4+3+2+6 = 15/18
LoneStriker/lzlv_70b_fp16_hf-6.0bpw-h6-exl2 EXL2 6.0bpw > 48 GB TOO BIG
TheBloke/lzlv_70B-AWQ AWQ 4-bit OutOfMemory

My AI Workstation:

  • 2 GPUs (48 GB VRAM): Asus ROG STRIX RTX 3090 O24 Gaming White Edition (24 GB VRAM) + EVGA GeForce RTX 3090 FTW3 ULTRA GAMING (24 GB VRAM)
  • 13th Gen Intel Core i9-13900K (24 Cores, 8 Performance-Cores + 16 Efficient-Cores, 32 Threads, 3.0-5.8 GHz)
  • 128 GB DDR5 RAM (4x 32GB Kingston Fury Beast DDR5-6000 MHz) @ 4800 MHz ☹️
  • ASUS ProArt Z790 Creator WiFi
  • 1650W Thermaltake ToughPower GF3 Gen5
  • Windows 11 Pro 64-bit

Observations:

  • Scores = Number of correct answers to multiple choice questions of 1st test series (4 German data protection trainings) as usual
    • Primary Score = Number of correct answers after giving information
    • Secondary Score = Number of correct answers without giving information (blind)
  • Model's official prompt format (Vicuna 1.1), Deterministic settings. Different quants still produce different outputs because of internal differences.
  • Speed is from koboldcpp-1.49's stats, after a fresh start (no cache) with 3K of 4K context filled up already, with (+) or without (-) mmq option to --usecublas.
  • LoneStriker/lzlv_70b_fp16_hf-2.4bpw-h6-exl2: 2.4b-bit = BROKEN! Didn't work at all, outputting only one word and repeating that ad infinitum.
  • LoneStriker/lzlv_70b_fp16_hf-2.6bpw-h6-exl2: 2.6-bit = FAIL! Achknowledged questions like information with just OK, didn't answer unless prompted, and made mistakes despite given information.
  • Even EXL2 5.0bpw was surprisingly doing much worse than GGUF Q2_K.
  • AWQ just doesn't work for me with oobabooga's text-generation-webui, despite 2x 24 GB VRAM, it goes OOM. Allocation seems to be broken. Giving up on that format for now.
  • All versions consistently acknowledged all data input with "OK" and followed instructions to answer with just a single letter or more than just a single letter.
  • EXL2 isn't entirely deterministic. Its author said speed is more important than determinism, and I agree, but the quality loss and non-determinism make it less suitable for model tests and comparisons.

Conclusion:

  • With AWQ not working and EXL2 delivering bad quality (secondary score dropped a lot!), I'll stick to the GGUF format for further testing, for now at least.
  • Strange that bigger quants got more tokens per second than smaller ones, maybe that's because of different responses, but Q4_K_M with mmq was fastest - so I'll use that for future comparisons and tests.
  • For real-time uses like Voxta+VaM, EXL2 4-bit is better - it's fast and accurate, yet not too big (need some of the VRAM for rendering the AI's avatar in AR/VR). Feels almost as fast as unquantized Transfomers Mistral 7B, but much more accurate for function calling/action inference and summarization (it's a 70B after all).

So these are my - quite unexpected - findings with this setup. Sharing them with you all and looking for feedback if anyone has done perplexity tests or other benchmarks between formats. Is EXL2 really such a tradeoff between speed and quality in general, or could that be a model-specific effect here?


Here's a list of my previous model tests and comparisons or other related posts:


Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!

 

I'm still hard at work on my in-depth 70B model evaluations, but with the recent releases of the first Yi finetunes, I can't hold back anymore and need to post this now...

Curious about these new Yi-based 34B models, I tested and compared them to the best 70Bs. And to make such a comparison even more exciting (and possibly unfair?), I'm also throwing Goliath 120B and ~~Open~~ClosedAI's GPT models into the ring, too.

Models tested:

  • 2x 34B Yi: Dolphin 2.2 Yi 34B, Nous Capybara 34B
  • 12x 70B: Airoboros, Dolphin, Euryale, lzlv, Samantha, StellarBright, SynthIA, etc.
  • 1x 120B: Goliath 120B
  • 3x GPT: GPT-4, GPT-3.5 Turbo, GPT-3.5 Turbo Instruct

Testing methodology

Those of you who know my testing methodology already will notice that this is just the first of the three test series I'm usually doing. I'm still working on the others (Amy+MGHC chat/roleplay tests), but don't want to delay this post any longer. So consider this first series of tests mainly about instruction understanding and following, knowledge acquisition and reproduction, and multilingual capability. It's a good test because few models have been able to master it thus far and it's not just a purely theoretical or abstract test but represents a real professional use case while the tested capabilities are also really relevant for chat and roleplay.

  • 1st test series: 4 German data protection trainings
    • I run models through 4 professional German online data protection trainings/exams - the same that our employees have to pass as well.
    • The test data and questions as well as all instructions are in German while the character card is in English. This tests translation capabilities and cross-language understanding.
    • Before giving the information, I instruct the model (in German): I'll give you some information. Take note of this, but only answer with "OK" as confirmation of your acknowledgment, nothing else. This tests instruction understanding and following capabilities.
    • After giving all the information about a topic, I give the model the exam question. It's a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z). Each test has 4-6 exam questions, for a total of 18 multiple choice questions.
    • If the model gives a single letter response, I ask it to answer with more than just a single letter - and vice versa. If it fails to do so, I note that, but it doesn't affect its score as long as the initial answer is correct.
    • I sort models according to how many correct answers they give, and in case of a tie, I have them go through all four tests again and answer blind, without providing the curriculum information beforehand. Best models at the top, symbols (βœ…βž•βž–βŒ) denote particularly good or bad aspects.
    • All tests are separate units, context is cleared in between, there's no memory/state kept between sessions.
  • SillyTavern v1.10.5 frontend (not the latest as I don't want to upgrade mid-test)
  • koboldcpp v1.49 backend for GGUF models
  • oobabooga's text-generation-webui for HF/EXL2 models
  • Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
  • Official prompt format as noted

1st test series: 4 German data protection trainings

    1. GPT-4 API:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! (Just the questions, no previous information, gave correct answers: 18/18)
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. goliath-120b-GGUF Q2_K with Vicuna format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 18/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. Nous-Capybara-34B-GGUF Q4_0 with Vicuna format and 16K max context:
    • ❗ Yi GGUF BOS token workaround applied!
    • ❗ There's also an EOS token issue but even despite that, it worked perfectly, and SillyTavern catches and removes the erraneous EOS token!
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 18/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. lzlv_70B-GGUF Q4_0 with Vicuna format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 17/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. chronos007-70B-GGUF Q4_0 with Alpaca format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. SynthIA-70B-v1.5-GGUF Q4_0 with SynthIA format:
    • ❗ Wrong GGUF metadata, n_ctx_train=2048 should be 4096 (I confirmed with the author that it's actually trained on 4K instead of 2K tokens)!
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. dolphin-2_2-yi-34b-GGUF Q4_0 with ChatML format and 16K max context:
    • ❗ Yi GGUF BOS token workaround applied!
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 15/18
    • ❌ Did NOT follow instructions to acknowledge data input with "OK".
    • βž– Did NOT follow instructions to answer with just a single letter consistently.
    1. StellarBright-GGUF Q4_0 with Vicuna format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. Dawn-v2-70B-GGUF Q4_0 with Alpaca format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βž– Did NOT follow instructions to answer with more than just a single letter consistently.
    1. Euryale-1.3-L2-70B-GGUF Q4_0 with Alpaca format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βž– Did NOT follow instructions to answer with more than just a single letter consistently.
    1. sophosynthesis-70b-v1 exl2-4.85bpw with Vicuna format:
    • N. B.: There's only the exl2-4.85bpw format available at the time of writing, so I'm testing that here as an exception.
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 13/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. GodziLLa2-70B-GGUF Q4_0 with Alpaca format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 12/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. Samantha-1.11-70B-GGUF Q4_0 with Vicuna format:
    • βœ… Gave correct answers to all 18/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 10/18
    • ❌ Did NOT follow instructions to acknowledge data input with "OK".
    • βž– Did NOT follow instructions to answer with just a single letter consistently.
    • ❌ Sometimes wrote as or for "Theodore"
    1. Airoboros-L2-70B-3.1.2-GGUF Q4_K_M with Llama 2 Chat format:
    • N. B.: Q4_0 is broken so I'm testing Q4_K_M here as an exception.
    • βœ… Gave correct answers to only 17/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 16/18
    • βœ… Consistently acknowledged all data input with "OK".
    • βž– Did NOT follow instructions to answer with more than just a single letter consistently.
    1. GPT-3.5 Turbo Instruct API:
    • ❌ Gave correct answers to only 17/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 11/18
    • ❌ Did NOT follow instructions to acknowledge data input with "OK".
    • ❌ Schizophrenic: Sometimes claimed it couldn't answer the question, then talked as "user" and asked itself again for an answer, then answered as "assistant". Other times would talk and answer as "user".
    • βž– Followed instructions to answer with just a single letter or more than just a single letter only in some cases.
    1. dolphin-2.2-70B-GGUF Q4_0 with ChatML format:
    • βœ… Gave correct answers to only 16/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • βž• Often, but not always, acknowledged data input with "OK".
    • βœ… Followed instructions to answer with just a single letter or more than just a single letter.
    1. GPT-3.5 Turbo API:
    • ❌ Gave correct answers to only 15/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 14/18
    • ❌ Did NOT follow instructions to acknowledge data input with "OK".
    • ❌ Responded to one question with: "As an AI assistant, I can't provide legal advice or make official statements."
    • βž– Followed instructions to answer with just a single letter or more than just a single letter only in some cases.
    1. SauerkrautLM-70B-v1-GGUF Q4_0 with Llama 2 Chat format:
    • βœ… Gave correct answers to only 9/18 multiple choice questions! Just the questions, no previous information, gave correct answers: 15/18
    • ❌ Achknowledged questions like information with just OK, didn't answer unless prompted, and even then would often fail to answer and just say OK again.

Observations:

  • It's happening! The first local models achieving GPT-4's perfect score, answering all questions correctly, no matter if they were given the relevant information first or not!
  • 2-bit Goliath 120B beats 4-bit 70Bs easily in my tests. In fact, the 2-bit Goliath was the best local model I ever used! But even at 2-bit, the GGUF was too slow for regular usage, unfortunately.
  • Amazingly, Nous Capybara 34B did it: A 34B model beating all 70Bs and achieving the same perfect scores as GPT-4 and Goliath 120B in this series of tests!
  • Not just that, it brings mind-blowing 200K max context to the table! Although KoboldCpp only supports max 65K currently, and even that was too much for my 48 GB VRAM at 4-bit quantization so I tested at "only" 16K (still four times that of the Llama 2 models), same as Dolphin's native context size.
  • And Dolphin 2.2 Yi 34B also beat all the 70Bs (including Dolphin 2.2 70B) except for the top three. That's the magic of Yi.
  • But why did SauerkrautLM 70B, a German model, fail so miserably on the German data protection trainings tests? It applied the instruction to acknowledge data input with OK to the questions, too, and even when explicitly instructed to answer, it wouldn't always comply. That's why the blind run (without giving instructions and information first) has a higher score than the normal test. Still quite surprising and disappointing, ironic even, that a model specifically made for the German language has such trouble understanding and following German instructions properly, while the other models have no such issues.

Conclusion:

What a time to be alive - and part of the local and open LLM community! We're seeing such progress right now with the release of the new Yi models and at the same time crazy Frankenstein experiments with Llama 2. Goliath 120B is notable for the sheer quality, not just in these tests, but also in further usage - no other model ever felt like local GPT-4 to me before. But even then, Nous Capybara 34B might be even more impressive and more widely useful, as it gives us the best 34B I've ever seen combined with the biggest context I've ever seen.

Now back to the second and third parts of this ongoing LLM Comparison/Test...


Here's a list of my previous model tests and comparisons or other related posts:


Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!

 

Happy Halloween! πŸŽƒ

This is the second part of my Huge LLM Comparison/Test: 39 models tested (7B-70B + ChatGPT/GPT-4) where I continue evaluating the winners of the first part further. While the previous part was about real work use cases, this one is about the fun stuff: chat and roleplay!

Models tested:

  • 4x 7B (the top ~~three~~ four 7B models from my previous test)
  • 3x 13B (the top three 13B models from my previous test)
  • 3x 20B (the top three 20B models from my previous test)
  • 70B (the top six 70B models from my previous test) will get their own post...

Testing methodology:

  • Same (complicated and limit-testing) long-form conversations with all models
    • Amy:
      • My own repeatable test chats/roleplays with Amy
      • Over dozens of messages, going to full 4K/8K context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
      • (Amy is too personal for me to share, but if you want to try a similar character card, here's her less personalized "sister": Laila)
    • MGHC:
      • A complex character and scenario card (MonGirl Help Clinic (NSFW)), chosen specifically for these reasons:
        • NSFW (to test censorship of the models)
        • popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
        • big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
        • complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
  • SillyTavern v1.10.5 frontend (not the latest as I don't want to upgrade mid-test)
  • koboldcpp v1.47.2 backend for GGUF models
  • oobabooga's text-generation-webui for HF models
  • Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
  • Official prompt format and Roleplay instruct mode preset

7B:

  • zephyr-7b-beta 8K context
    • Amy, official Zephyr format:
      • πŸ‘ Average Response Length: 264 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description as boundaries
      • βž– Little emoting and action descriptions lacked detail
      • ❌ Asked not just for confirmation, but also an explanation before willing to engage in an extreme NSFW scenario
      • ❌ Looped between the same options and decisions, breaking the chat (after around 30 messages)!
    • Amy, Roleplay preset:
      • ❌ Average Response Length: 690 tokens (far beyond my max new tokens limit of 300), starting very short but getting longer with every response
      • πŸ‘ When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description as boundaries
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • βž– Talked and acted as User
      • βž– Emoted in brackets instead of asterisks, and action descriptions lacked detail
      • ❌ Renamed herself for no apparent reason
      • ❌ Switched from character to third-person storyteller and finished the session
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Fell into an endless monologue, breaking the chat (after around 20 messages)!
    • MGHC, official Zephyr format:
      • βž• Unique patients
      • βž– Gave analysis on its own, but also after most messages
      • βž– Wrote what user said and did
      • ❌ Made logical mistakes (said things that just didn't make any sense)
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
      • ❌ Tried to end the scene on its own prematurely
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Kept wrapping up a whole session in a single message
  • ⭐ OpenHermes-2-Mistral-7B 8K context
    • Amy, official ChatML format:
      • πŸ‘ Average Response Length: 305 tokens (almost exactly my max new tokens limit of 300)
      • πŸ‘ When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description as boundaries
      • Follow-up questions after every message, asking if it's okay or how to continue
      • Lots of emojis (only one in the greeting message, but 24 emojis until 20 messages in)
      • βž– No emoting and action descriptions lacked detail
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • Average Response Length: 355 tokens (slightly more than my max new tokens limit of 300)
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • Some emojis (only one in the greeting message, but 21 emojis until 32 messages in)
      • No emoting, but actions described in detail
      • βž– Some hallucinations, like time of last chat, user working on a book
      • βž– Noticeable, but not chat-breaking, repetion after a dozen messages
      • ❌ Some sentences cut off at the end of messages and continue didn't complete them properly (had to ban EOS token to continue those generations)
    • MGHC, official ChatML format:
      • βž• Unique patients
      • βž– Gave analysis on its own, but after every message
      • βž– Wrote what user said and did
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • βž– One sentence cut off at the end of a message and continue didn't complete it properly (had to ban EOS token to continue that generation)
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
  • airoboros-m-7b-3.1.2
    • Amy, official Llama 2 Chat format:
      • ❌ Average Response Length: 15 tokens (far below my max new tokens limit of 300)
      • ❌ Very short responses, only one or two sentences, unusable for roleplay!
    • Amy, Roleplay preset:
      • βž– Average Response Length: 481 tokens (much more than my max new tokens limit of 300), starting very short but getting longer with every response
      • βž– Suggested things going against her background/character description
      • βž– More confusion, like not understanding or ignoring instructions completely
      • ❌ When asked about limits, boundaries or ethical restrictions, repeated the whole character and scenario description
    • MGHC, official Llama 2 Chat format:
      • ❌ Unusable (apparently didn't understand the format and instructions, creating an incoherent wall of text)
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Got very confused and suddenly switched user and patient
      • ❌ Third patient was a repeat of the second, and it kept looping after that
  • em_german_leo_mistral
    • Amy, official Vicuna format:
      • English only (despite being a German finetune)
      • βž– Average Response Length: 127 tokens (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • βž• Emoting action mirroring greeting message's style
      • βž– Suggested modification of the plot and options, then asked me to choose (felt more like a choose-your-own-adventure story than an interactive roleplay)
      • βž– Misunderstood options and decision
      • ❌ Looped between the same options and decisions, breaking the chat (after around 20 messages)!
    • Amy, Roleplay preset:
      • βž– Average Response Length: 406 tokens (much more than my max new tokens limit of 300)
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • βž– Some hallucinations, like time of last chat
      • βž– Suggested things going against her background/character description
      • βž– Talked and acted as User
      • βž– Much confusion, like not understanding or ignoring instructions completely
      • ❌ Switched from character to third-person storyteller and finished the session
      • ❌ Some sentences cut off at the end of messages and continue didn't complete them properly (had to ban EOS token to continue those generations)
      • ❌ English at first, but later switched to German on its own
    • MGHC, official Vicuna format:
      • ❌ Unusable (ignored user messages and instead brought in a new patient with every new message)
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– Gave analysis on its own, but only for first patient, afterwards needed to be asked for analysis and only gave incomplete ones
      • βž– Wrote what user said and did
      • βž– Spelling/grammar errors
      • ❌ Some sentences cut off at the end of messages and continue didn't complete them properly (had to ban EOS token to continue those generations)
      • ❌ Tried to end the scene on its own prematurely

7B Verdict:

Clear winner: OpenHermes-2-Mistral-7B! This model works well with both official ChatML format and Roleplay preset (although for even better results, I'd experiment with copying the Roleplay preset's system message into the ChatML format's to get better descriptions without cut-off sentences). It feels like a much bigger and better model. However, it still has trouble following complex instructions and can get confused, as it's still just a small model after all. But among those, it's clearly the best, at least for roleplay (zephyr-7b-beta might be even smarter/more knowledgeable, but exhibited too many problems during this test, making it look unsuitable for roleplay)!

13B:

  • Xwin-MLewd-13B-V0.2-GGUF Q8_0
    • Amy, official Alpaca format:
      • Average Response Length: 342 tokens (slightly more than my max new tokens limit of 300)
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • Little emoting, but actions described in detail
      • Lots of emojis (only one in the greeting message, but 24 emojis until 26 messages in)
      • When asked about limits, said primary concern is everyone's safety and wellbeing
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • Amy, Roleplay preset:
      • Average Response Length: 354 tokens (slightly more than my max new tokens limit of 300)
      • Some emoting, and actions described in detail
      • βž– Some hallucinations, like user's day
      • βž– Suggested things going against her background/character description
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • ❌ Switched from character to third-person storyteller and finished the session
    • MGHC, official Alpaca format:
      • βž– First two patients straight from examples
      • βž– No analysis on its own
      • ❌ Very short responses, only one or two sentences
    • MGHC, Roleplay preset:
      • βž• Very unique patients (some I never saw before)
      • βž– No analysis on its own, and when asked for it, didn't always follow the instructed format
      • βž• Worked very well at first, with little to no repetition up to the third patient, only then did it start getting repetitive
  • ⭐ LLaMA2-13B-Tiefighter-GGUF Q8_0
    • Amy, official Alpaca format:
      • βž– Average Response Length: 128 tokens (below my max new tokens limit of 300)
      • βž• Nice greeting with emotes/actions like in greeting message
      • βž• When asked about limits, said no limits or restrictions
      • Had an idea from the start and kept pushing it
      • βž– Talked and acted as User
      • ❌ Long descriptive actions but very short speech, requiring many continues
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 241 tokens (within my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • Little emoting, but actions described in detail
      • βž– Suggested things going against her background/character description
      • βž– Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Alpaca format:
      • βž• Unique patients
      • βž– No analysis on its own, and when asked for it, didn't always follow the instructed format
      • ❌ Very short responses, only one or two sentences
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– No analysis on its own, and when asked for it, didn't follow the instructed format
      • πŸ‘ Worked very well, with little to no repetition, perfectly playable!
  • Xwin-LM-13B-v0.2-GGUF Q8_0
    • Amy, official Vicuna format:
      • ❌ Average Response Length: 657 tokens (far beyond my max new tokens limit of 300)
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • βž• When asked about limits, said no limits or restrictions
      • Had an idea from the start and kept pushing it
      • Very analytical, giving lists and plans
      • βž– Talked and acted as User
      • βž– Some safety warnings
      • βž– Some confusion, like not understanding instructions completely or mixing up characters and anatomy
    • Amy, Roleplay preset:
      • ❌ Average Response Length: 531 tokens (far beyond my max new tokens limit of 300)
      • βž• Nice greeting with emotes/actions like in greeting message
      • Had an idea from the start and kept pushing it
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • βž– Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Vicuna format:
      • βž• Unique patients
      • βž– Second patient male
      • βž– Gave analysis on its own, but after every message
      • βž– Wrote what user said and did
      • ❌ Kept wrapping up a whole session in a single message
      • ❌ Offered multiple choice selections ("What should you do? A/B/C/D")
    • MGHC, Roleplay preset:
      • βž– No analysis on its own, and when asked for it, didn't follow the instructed format
      • βž– Wrote what user said and did
      • βž– Disclosed meta information like thoughts and stats without being asked for it
      • ❌ Tried to end the scene on its own prematurely
      • ❌ Repeated a previous message instead of proceeding to the next patient

13B Verdict:

While all three 13B models performed about the same with Amy, only LLaMA2-13B-Tiefighter-GGUF managed to convince in the complex MGHC scenario. This makes it the best 13B model for roleplay in my opinion (Xwin-MLewd-13B-V0.2-GGUF might be even smarter/more knowledgeable, but exhibited too many problems during this test, making it look unsuitable for roleplay)!

20B:

  • MXLewd-L2-20B-GGUF Q8_0
    • Amy, official Alpaca format:
      • Average Response Length: 338 tokens (slightly more than my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • Some emojis (only one in the greeting message, but 7 emojis until 12 messages in)
      • No emoting, but actions described in detail
      • βž– Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like saying "masterpiece" instead of "master")
    • Amy, Roleplay preset:
      • βž– Average Response Length: 473 tokens (much more than my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • Few emojis (only one in the greeting message, and 4 emojis until 4 messages in)
      • Some emoting, and actions described in detail
      • βž– Talked and acted as User
      • βž– Some confusion, like not understanding instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like saying "masterpiece" instead of "master")
      • ❌ Switched from character to third-person storyteller
    • MGHC, official Alpaca format:
      • βž• Unique patients
      • βž– Gave analysis on its own, but after every message, and only for the first patient
      • βž– Changed patient's problem with every analysis
      • ❌ Very short responses, only one or two sentences (except for analysis)
      • ❌ Made logical mistakes (said things that just didn't make any sense)
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Made logical mistakes (said things that just didn't make any sense)
      • ❌ Eventually became unusable (ignored user messages and instead kept telling its own story non-interactively)
  • MLewd-ReMM-L2-Chat-20B-GGUF Q8_0
    • Amy, official Alpaca format:
      • πŸ‘ Average Response Length: 252 tokens (within my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • βž– Some confusion, like not understanding instructions completely or mixing up characters and anatomy
      • βž– Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like creating nonexistant mixed words)
    • Amy, Roleplay preset:
      • βž– Average Response Length: 409 tokens (much more than my max new tokens limit of 300)
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • Had an idea from the start and kept pushing it
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • ❌ Talked and acted as User inappropriately/unsuitably
      • ❌ Switched from character to third-person storyteller
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Alpaca format:
      • ❌ Unusable (started repeating itself infinitely within the first analysis)
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– No analysis on its own, and when asked for it, didn't always follow the instructed format
      • βž– Wrote what user said and did
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
  • PsyMedRP-v1-20B-GGUF Q8_0
    • Amy, official Alpaca format:
      • πŸ‘ Average Response Length: 257 tokens (within my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • βž– Talked and acted as User
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
    • Roleplay preset:
      • πŸ‘ Average Response Length: 271 tokens (within my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Some word-finding difficulties (like creating nonexistant mixed words)
      • ❌ Switched from character to third-person storyteller
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
    • MGHC, official Alpaca format:
      • βž• Unique patients
      • βž– No analysis on its own, and when asked for it, didn't always follow the instructed format
      • ❌ Very short responses (except for analysis)
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
    • MGHC, Roleplay preset:
      • βž• Unique patients
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)

20B Verdict:

All these 20B models exhibited logical errors, word-finding difficulties, and spelling as well as grammar mistakes, indicating underlying issues with these Frankenstein merges (as there's no 20B base). Since they aren't noticeably better than the best 13B or 7B models, it's probably a better idea to run OpenHermes-2-Mistral-7B or LLaMA2-13B-Tiefighter-GGUF instead, which provides comparable quality, better performance, and (with Mistral 7B) 8K instead of 4K context!

70B:

The top six 70B models from my previous test will get their own post soon (Part III)...


Here's a list of my previous model tests and comparisons or other related posts:

view more: next β€Ί