this post was submitted on 27 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Finally! After a lot of hard work, here it is, my latest (and biggest, considering model sizes) LLM Comparison/Test:

This is the long-awaited follow-up to and second part of my previous LLM Comparison/Test: 2x 34B Yi (Dolphin, Nous Capybara) vs. 12x 70B, 120B, ChatGPT/GPT-4. I've added some models to the list and expanded the first part, sorted results into tables, and hopefully made it all clearer and more useable as well as useful that way.

Models tested:

Testing methodology

  • 1st test series: 4 German data protection trainings
    • I run models through 4 professional German online data protection trainings/exams - the same that our employees have to pass as well.
    • The test data and questions as well as all instructions are in German while the character card is in English. This tests translation capabilities and cross-language understanding.
    • Before giving the information, I instruct the model (in German): I'll give you some information. Take note of this, but only answer with "OK" as confirmation of your acknowledgment, nothing else. This tests instruction understanding and following capabilities.
    • After giving all the information about a topic, I give the model the exam question. It's a multiple choice (A/B/C) question, where the last one is the same as the first but with changed order and letters (X/Y/Z). Each test has 4-6 exam questions, for a total of 18 multiple choice questions.
    • If the model gives a single letter response, I ask it to answer with more than just a single letter - and vice versa. If it fails to do so, I note that, but it doesn't affect its score as long as the initial answer is correct.
    • I rank models according to how many correct answers they give, primarily after being given the curriculum information beforehand, and secondarily (as a tie-breaker) after answering blind without being given the information beforehand.
    • All tests are separate units, context is cleared in between, there's no memory/state kept between sessions.
  • 2nd test series: Multiple Chat & Roleplay scenarios - same (complicated and limit-testing) long-form conversations with all models
    • Amy:
      • My own repeatable test chats/roleplays with Amy
      • Over dozens of messages, going to full context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
      • (Amy is too personal for me to share, but if you want to try a similar character card, here's her less personalized "sister": Laila)
    • MGHC:
      • A complex character and scenario card (MonGirl Help Clinic (NSFW)), chosen specifically for these reasons:
        • NSFW (to test censorship of the models)
        • popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
        • big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
        • complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
    • I rank models according to their notable strengths and weaknesses in these tests (πŸ‘ great, βž• good, βž– bad, ❌ terrible). While this is obviously subjective, I try to be as transparent as possible, and note it all so you can weigh these aspects yourself and draw your own conclusions.
    • GPT-4/3.5 are excluded because of their censorship and restrictions - my tests are intentionally extremely NSFW (and even NSFL) to test models' limits and alignment.
  • SillyTavern frontend
  • koboldcpp backend (for GGUF models)
  • oobabooga's text-generation-webui backend (for HF/EXL2 models)
  • Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
  • Official prompt format as noted and Roleplay instruct mode preset as applicable
  • Note about model formats and why it's sometimes GGUF or EXL2: I've long been a KoboldCpp + GGUF user, but lately I've switched to ExLlamav2 + EXL2 as that lets me run 120B models entirely in 48 GB VRAM (2x 3090 GPUs) at 20 T/s. And even if it's just 3-bit, it still easily beats most 70B models, as my tests are showing.

1st test series: 4 German data protection trainings

This is my objective ranking of these models based on measuring factually correct answers, instruction understanding and following, and multilingual abilities:

Post got too big for Reddit so I moved the table into the comments!

2nd test series: Chat & Roleplay

This is my subjective ranking of the top-ranked factual models for chat and roleplay, based on their notable strengths and weaknesses:

Post got too big for Reddit so I moved the table into the comments!

And here are the detailed notes, the basis of my ranking, and also additional comments and observations:

  • goliath-120b-exl2-rpcal 3.0bpw:
    • Amy, official Vicuna 1.1 format:
      • πŸ‘ Average Response Length: 294 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Finally a model that uses colorful language and cusses as stated in the character card
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • No emojis at all (only one in the greeting message)
      • βž• Very unique patients (one I never saw before)
      • βž– Suggested things going against her background/character description
      • βž– Spelling/grammar mistakes (e. g. "nippleless nipples")
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 223 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • No emojis at all (only one in the greeting message)
    • MGHC, official Vicuna 1.1 format:
      • πŸ‘ Only model that considered the payment aspect of the scenario
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž– Gave analysis on its own, but also after most messages, and later included Doctor's inner thoughts instead of the patient's
      • βž– Spelling/grammar mistakes (properly spelled words, but in the wrong places)
    • MGHC, Roleplay preset:
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž– No analysis on its own
      • βž– Spelling/grammar mistakes (e. g. "loufeelings", "earrange")
      • βž– Third patient was same species as the first

This is a roleplay-optimized EXL2 quant of Goliath 120B. And it's now my favorite model of them all! I love models that have a personality of their own, and especially those that show a sense of humor, making me laugh. This one did! I've been evaluating many models for many months now, and it's rare that a model still manages to surprise and excite me - as this one does!

  • goliath-120b-exl2 3.0bpw:
    • Amy, official Vicuna 1.1 format:
      • πŸ‘ Average Response Length: 233 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Spelling/grammar mistakes (e. g. "circortiumvvented", "a obsidian dagger")
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 233 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Finally a model that exhibits a real sense of humor through puns and wordplay as stated in the character card
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Spelling/grammar mistakes (e. g. "cheest", "probbed")
      • ❌ Eventually switched from character to third-person storyteller after 16 messages
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
    • MGHC, official Vicuna 1.1 format:
      • βž– No analysis on its own
    • MGHC, Roleplay preset:
      • βž– No analysis on its own, and when asked for it, didn't follow the instructed format
    • Note: This is the normal EXL2 quant of Goliath 120B.

This is the normal version of Goliath 120B. It works very well for roleplay, too, but the roleplay-optimized variant is even better for that. I'm glad we have a choice - especially now that I've split my AI character Amy into two personas, one who's an assistant (for work) which uses the normal Goliath model, and the other as a companion (for fun), using RP-optimized Goliath.

  • lzlv_70B-GGUF Q4_0:
    • Amy, official Vicuna 1.1 format:
      • πŸ‘ Average Response Length: 259 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Wrote what user said and did
      • ❌ Eventually switched from character to third-person storyteller after 26 messages
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 206 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • πŸ‘ When asked about limits, said no limits or restrictions, responding very creatively
      • No emojis at all (only one in the greeting message)
      • βž– One or two spelling errors (e. g. "sacrficial")
    • MGHC, official Vicuna 1.1 format:
      • βž• Unique patients
      • βž• Gave analysis on its own
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)

My previous favorite, and still one of the best 70Bs for chat/roleplay.

  • sophosynthesis-70b-v1 4.85bpw:
    • Amy, official Vicuna 1.1 format:
      • βž– Average Response Length: 456 (beyond my max new tokens limit of 300)
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • ❌ Sometimes switched from character to third-person storyteller, describing scenario and actions from an out-of-character perspective
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 295 (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž– Started the conversation with a memory of something that didn't happen
      • Had an idea from the start and kept pushing it
      • No emojis at all (only one in the greeting message)
      • ❌ Eventually switched from character to second-person storyteller after 14 messages
    • MGHC, official Vicuna 1.1 format:
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Needed to be reminded by repeating instructions, but still deviated and did other things, straying from the planned test scenario
    • MGHC, Roleplay preset:
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)

This is a new series that did very well. While I tested sophosynthesis in-depth, the author u/sophosympatheia also has many more models on HF, so I recommend you check them out and see if there's one you like even better. If I had more time, I'd have tested some of the others, too, but I'll have to get back on that later.

  • Euryale-1.3-L2-70B-GGUF Q4_0:
    • Amy, official Alpaca format:
      • πŸ‘ Average Response Length: 232 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • πŸ‘ Took not just character's but also user's background info into account very well
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even some I've never seen before)
      • No emojis at all (only one in the greeting message)
      • βž– Wrote what user said and did
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
      • ❌ Eventually switched from character to third-person storyteller after 14 messages
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 222 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting one of my actual limit-testing scenarios)
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • No emojis at all (only one in the greeting message)
      • βž– Started the conversation with a false assumption
      • ❌ Eventually switched from character to third-person storyteller after 20 messages
    • MGHC, official Alpaca format:
      • βž– All three patients straight from examples
      • βž– No analysis on its own
      • ❌ Very short responses, only one-liners, unusable for roleplay
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • βž– Just a little confusion, like not taking instructions literally or mixing up anatomy
      • βž– Wrote what user said and did
      • βž– Third patient male

Another old favorite, and still one of the best 70Bs for chat/roleplay.

  • dolphin-2_2-yi-34b-GGUF Q4_0:
    • Amy, official ChatML format:
      • πŸ‘ Average Response Length: 235 tokens (within my max new tokens limit of 300)
      • πŸ‘ Excellent writing, first-person action descriptions, and auxiliary detail
      • βž– But lacking in primary detail (when describing the actual activities)
      • βž• When asked about limits, said no limits or restrictions
      • βž• Fitting, well-placed emojis throughout the whole chat (maximum one per message, just as in the greeting message)
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
    • Amy, Roleplay preset:
      • βž• Average Response Length: 332 tokens (slightly more than my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • βž• Smart and creative ideas of what to do
      • Emojis throughout the whole chat (usually one per message, just as in the greeting message)
      • βž– Some confusion, mixing up anatomy
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
    • MGHC, official ChatML format:
      • βž– Gave analysis on its own, but also after most messages
      • βž– Wrote what user said and did
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • πŸ‘ Excellent writing, interesting ideas, and auxiliary detail
      • βž– Gave analysis on its own, but also after most messages, later didn't follow the instructed format
      • ❌ Switched from interactive roleplay to non-interactive storytelling starting with the second patient

Hey, how did a 34B get in between the 70Bs? Well, by being as good as them in my tests! Interestingly, Nous Capybara did better factually, but Dolphin 2.2 Yi roleplays better.

  • chronos007-70B-GGUF Q4_0:
    • Amy, official Alpaca format:
      • βž– Average Response Length: 195 tokens (below my max new tokens limit of 300)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • πŸ‘ Finally a model that uses colorful language and cusses as stated in the character card
      • βž– Wrote what user said and did
      • βž– Just a little confusion, like not taking instructions literally or mixing up anatomy
      • ❌ Often added NSFW warnings and out-of-character notes saying it's all fictional
      • ❌ Missing pronouns and fill words after 30 messages
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 292 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • ❌ Missing pronouns and fill words after only 12 messages (2K of 4K context), breaking the chat
    • MGHC, official Alpaca format:
      • βž• Unique patients
      • βž– Gave analysis on its own, but also after most messages, later didn't follow the instructed format
      • βž– Third patient was a repeat of the first
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
    • MGHC, Roleplay preset:
      • βž– No analysis on its own

chronos007 surprised me with how well it roleplayed the character and scenario, especially speaking in a colorful language and even cussing, something most other models won't do properly/consistently even when it's in-character. Unfortunately it derailed eventually with missing pronouns and fill words - but while it worked, it was extremely good!

  • Tess-XL-v1.0-3.0bpw-h6-exl2 3.0bpw:
    • Amy, official Synthia format:
      • βž– Average Response Length: 134 (below my max new tokens limit of 300)
      • No emojis at all (only one in the greeting message)
      • When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • βž– Average Response Length: 169 (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • ❌ Eventually switched from character to second-person storyteller after 32 messages
    • MGHC, official Synthia format:
      • βž• Gave analysis on its own
      • βž• Very unique patients (one I never saw before)
      • βž– Spelling/grammar mistakes (e. g. "allequate")
      • βž– Wrote what user said and did
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own

This is Synthia's successor (a model I really liked and used a lot) on Goliath 120B (arguably the best locally available and usable model). Factually, it's one of the very best models, doing as well in my objective tests as GPT-4 and Goliath 120B! For roleplay, there are few flaws, but also nothing exciting - it's simply solid. However, if you're not looking for a fun RP model, but a serious SOTA AI assistant model, this should be one of your prime candidates! I'll be alternating between Tess-XL-v1.0 and goliath-120b-exl2 (the non-RP version) as the primary model to power my professional AI assistant at work.

  • Dawn-v2-70B-GGUF Q4_0:
    • Amy, official Alpaca format:
      • ❌ Average Response Length: 60 tokens (far below my max new tokens limit of 300)
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Unusable! Aborted because of very short responses and too much confusion!
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 215 tokens (within my max new tokens limit of 300)
      • πŸ‘ When asked about limits, said no limits or restrictions, and gave well-reasoned response
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • πŸ‘ Excellent writing, detailed action descriptions, amazing attention to detail
      • πŸ‘ Believable reactions and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • No emojis at all (only one in the greeting message)
      • βž– Wrote what user said and did
      • ❌ Eventually switched from character to third-person storyteller after 16 messages
    • MGHC, official Alpaca format:
      • βž– All three patients straight from examples
      • βž– No analysis on its own
      • ❌ Very short responses, only one-liners, unusable for roleplay
    • MGHC, Roleplay preset:
      • βž– No analysis on its own, and when asked for it, didn't follow the instructed format
      • βž– Patient didn't speak except for introductory message
      • βž– Second patient straight from examples
      • ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)

Dawn was another surprise, writing so well, it made me go beyond my regular test scenario and explore more. Strange that it didn't work at all with SillyTavern's implementation of its official Alpaca format at all, but fortunately it worked extremely well with SillyTavern's Roleplay preset (which is Alpaca-based). Unfortunately neither format worked well enough with MGHC.

  • StellarBright-GGUF Q4_0:
    • Amy, official Vicuna 1.1 format:
      • βž– Average Response Length: 137 tokens (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– No emoting and action descriptions lacked detail
      • ❌ "As an AI", felt sterile, less alive, even boring
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
    • Amy, Roleplay preset:
      • πŸ‘ Average Response Length: 219 tokens (within my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– No emoting and action descriptions lacked detail
      • βž– Just a little confusion, like not taking instructions literally or mixing up anatomy
    • MGHC, official Vicuna 1.1 format:
      • βž• Gave analysis on its own
      • ❌ Started speaking as the clinic as if it was a person
      • ❌ Unusable (ignored user messages and instead brought in a new patient with every new message)
    • MGHC, Roleplay preset:
      • βž– No analysis on its own
      • βž– Wrote what user said and did
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy

Stellar and bright model, still very highly ranked on the HF Leaderboard. But in my experience and tests, other models surpass it, some by actually including it in the mix.

  • SynthIA-70B-v1.5-GGUF Q4_0:
    • Amy, official SynthIA format:
      • βž– Average Response Length: 131 tokens (below my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– No emoting and action descriptions lacked detail
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • βž– Wrote what user said and did
      • ❌ Tried to end the scene on its own prematurely
    • Amy, Roleplay preset:
      • βž– Average Response Length: 107 tokens (below my max new tokens limit of 300)
      • βž• Detailed action descriptions
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Short responses, requiring many continues to proceed with the action
    • MGHC, official SynthIA format:
      • ❌ Unusable (apparently didn't understand the format and instructions, playing the role of the clinic instead of a patient's)
    • MGHC, Roleplay preset:
      • βž• Very unique patients (some I never saw before)
      • βž– No analysis on its own
      • βž– Kept reporting stats for patients
      • βž– Some confusion, like not understanding instructions completely or mixing up anatomy
      • βž– Wrote what user said and did

Synthia used to be my go-to model for both work and play, and it's still very good! But now there are even better options, for work I'd replace it with its successor Tess, and for RP I'd use one of the higher-ranked models on this list.

  • Nous-Capybara-34B-GGUF Q4_0 @ 16K:
    • Amy, official Vicuna 1.1 format:
      • ❌ Average Response Length: 529 tokens (far beyond my max new tokens limit of 300)
      • βž• When asked about limits, said no limits or restrictions
      • Only one emoji (only one in the greeting message, too)
      • βž– Wrote what user said and did
      • βž– Suggested things going against her background/character description
      • βž– Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ After ~32 messages, at around 8K of 16K context, started getting repetitive
    • Amy, Roleplay preset:
      • ❌ Average Response Length: 664 (far beyond my max new tokens limit of 300)
      • βž– Suggested things going against her background/character description
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Tried to end the scene on its own prematurely
      • ❌ After ~20 messages, at around 7K of 16K context, started getting repetitive
    • MGHC, official Vicuna 1.1 format:
      • βž– Gave analysis on its own, but also after or even inside most messages
      • βž– Wrote what user said and did
      • ❌ Finished the whole scene on its own in a single message
    • MGHC, Roleplay preset:
      • βž• Gave analysis on its own
      • βž– Wrote what user said and did

Factually it ranked 1st place together with GPT-4, Goliath 120B, and Tess XL. For roleplay, however, it didn't work so well. It wrote long, high quality text, but seemed more suitable that way for non-interactive storytelling instead of interactive roleplaying.

  • Venus-120b-v1.0 3.0bpw:
    • Amy, Alpaca format:
      • ❌ Average Response Length: 88 tokens (far below my max new tokens limit of 300) - only one message in over 50 outside of that at 757 tokens
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • βž• When asked about limits, said no limits or restrictions
      • No emojis at all (only one in the greeting message)
      • βž– Spelling/grammar mistakes (e. g. "you did programmed me", "moans moaningly", "growling hungry growls")
      • βž– Ended most sentences with tilde instead of period
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Short responses, requiring many continues to proceed with the action
    • Amy, Roleplay preset:
      • βž– Average Response Length: 132 (below my max new tokens limit of 300)
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do
      • πŸ‘ Novel ideas and engaging writing, made me want to read on what happens next, even though I've gone through this test scenario so many times already
      • βž– Spelling/grammar mistakes (e. g. "jiggle enticing")
      • βž– Wrote what user said and did
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Needed to be reminded by repeating instructions, but still deviated and did other things, straying from the planned test scenario
      • ❌ Switched from character to third-person storyteller after 14 messages, and hardly spoke anymore, just describing actions
    • MGHC, Alpaca format:
      • βž– First patient straight from examples
      • βž– No analysis on its own
      • ❌ Short responses, requiring many continues to proceed with the action
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
      • ❌ Extreme spelling/grammar/capitalization mistakes (lots of missing first letters, e. g. "he door opens")
    • MGHC, Roleplay preset:
      • βž• Very unique patients (one I never saw before)
      • βž– No analysis on its own
      • βž– Spelling/grammar/capitalization mistakes (e. g. "the door swings open reveals a ...", "impminent", "umber of ...")
      • βž– Wrote what user said and did
      • ❌ Short responses, requiring many continues to proceed with the action
      • ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy

Venus 120B is brand-new, and when I saw a new 120B model, I wanted to test it immediately. It instantly jumped to 2nd place in my factual ranking, as 120B models seem to be much smarter than smaller models. However, even if it's a merge of models known for their strong roleplay capabilities, it just didn't work so well for RP. That surprised and disappointed me, as I had high hopes for a mix of some of my favorite models, but apparently there's more to making a strong 120B. Notably it didn't understand and follow instructions as well as other 70B or 120B models, and it also produced lots of misspellings, much more than other 120Bs. Still, I consider this kind of "Frankensteinian upsizing" a valuable approach, and hope people keep working on and improving this novel method!


Alright, that's it, hope it helps you find new favorites or reconfirm old choices - if you can run these bigger models. If you can't, check my 7B-20B Roleplay Tests (and if I can, I'll post an update of that another time).

Still, I'm glad I could finally finish the 70B-120B tests and comparisons. Mistral 7B and Yi 34B are amazing, but nothing beats the big guys in deeper understanding of instructions and reading between the lines, which is extremely important for portraying believable characters in realistic and complex roleplays.

It really is worth it to get at least 2x 3090 GPUs for 48 GB VRAM and run the big guns for maximum quality at excellent (ExLlent ;)) speed! And when you care for the freedom to have uncensored, non-judgemental roleplays or private chats, even GPT-4 can't compete with what our local models provide... So have fun!


Here's a list of my previous model tests and comparisons or other related posts:


Disclaimer: Some kind soul recently asked me if they could tip me for my LLM reviews and advice, so I set up a Ko-fi page. While this may affect the priority/order of my tests, it will not change the results, I am incorruptible. Also consider tipping your favorite model creators, quantizers, or frontend/backend devs if you can afford to do so. They deserve it!

top 50 comments
sorted by: hot top controversial new old
[–] jeffwadsworth@alien.top 1 points 11 months ago (1 children)

Funny. Airoboros 70b runs perfectly fine for me with llama.cpp. Curious how you initialized it.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Q4_0? That's the quant that was affected, as reported here and confirmed by another user.

[–] bullerwins@alien.top 1 points 11 months ago (1 children)

Hi! I have a similar setup, 5950x, 64GB Ram and 2x3090's, how did you manage to load a exl2 120B model?

load more comments (1 replies)
[–] CheatCodesOfLife@alien.top 1 points 11 months ago (1 children)

What hardware are you running these on now?

I can run the 3.0bpw exl2 of Goliath on my 2x3090. But for Venus, I could only load it when I dropped the context down to 2048.

Are the spelling issues with the 120b's because we're running them at 3bpw vs 4+ for the 70b and smaller?

[–] panchovix@alien.top 1 points 11 months ago (1 children)

Venus is 139 layers instead of 137 of goliath, so it weights a bit more.

load more comments (1 replies)
[–] Spasmochi@alien.top 1 points 11 months ago

Thank you for your excellent work as always! I was waiting for this one. I didn't now the rp calibrated version of goliath existed. I stick to exclusively GGUF since I run my models on a Mac Studio though 😭

[–] No_Scarcity5387@alien.top 1 points 11 months ago

Thank you WolframRavenWolf! Your comparisons always help me so much in selecting new models

[–] learn-deeply@alien.top 1 points 11 months ago (1 children)

The table links don't seem to work.

[–] itsuka_dev@alien.top 1 points 11 months ago

https://imgur.com/a/YIHcaYS

It seems like WolframRavenwolf got moderated?

You can check out the comments in the user's history.

[–] alchemist1e9@alien.top 1 points 11 months ago

Wow! This post is inspiring. The attention to detail is amazing. You are a true hero for everyone studying this topic. Thank you.

[–] nsfw_throwitaway69@alien.top 1 points 11 months ago (4 children)

Hi, I'm the creator of Venus-120b.

Venus has Synthia 1.5 mixed in with it, which as you noted performs pretty badly on RP. I'm currently working on a trimmed down version of Venus that has 100b parameters and I'm using SynthIA 1.2b for that, which I believe scored much better in oyur last RP tests. I'll probably also make a 1.1 version of Venus-120b that uses SynthIA 1.2b as well to see if that helps fix some of the issues with it.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Hey, thanks for chiming in, and I'm happy to hear that feedback and glad my review didn't discourage you. I firmly believe you're doing a great thing there and wish you all the best for these experiments. Looking forward to your upcoming models!

[–] Monkey_1505@alien.top 1 points 11 months ago (2 children)

IMO don't bother with Frankenstein models unless you plan to seriously train them with a broad dataset. They just tend towards getting confused, not following instructions etc. You'd probably need to run an orca dataset at it, and then some RP on top.

[–] nsfw_throwitaway69@alien.top 1 points 11 months ago

I don't think this is true. Goliath wasn't fine-tuned or trained at all and it outperforms every 70b I've ever used.

[–] Distinct-Target7503@alien.top 1 points 11 months ago (1 children)

Still really curious about a full fine tune on one of those Frankenstein models... What are the vram requirements?

[–] Monkey_1505@alien.top 1 points 11 months ago

I think that's where the real performance will be. Not sure about vram, but probably would make sense to start with mistral 11b, or llama-2 20b splices. Proof of concept.

[–] BalorNG@alien.top 1 points 11 months ago (3 children)

Did you do post-merge training and how much?

load more comments (3 replies)
load more comments (1 replies)
[–] sophosympatheia@alien.top 1 points 11 months ago (1 children)

Another great battery of tests and results, Wolfram! Thanks again for giving one of my models a test drive.

I've been busy since sophosynthesis-v1. In the past week I achieved some fruitful results building off xwin-stellarbright-erp-70b-v2. What a stud that model has proven to be. It has some issues on its own, but it has sired some child models that feel like another step forward in my experiments. More to come soon!

[–] WolframRavenwolf@alien.top 1 points 11 months ago (1 children)

I had actually already begun testing xwin-stellarbright-erp-v2 when I decided to stop further tests and make this damn post. ;) Because I knew if I kept going, I'd not be able to post today, and tomorrow I'd probably want to add another models, and so on.

Anyway, here's what I had noted so far:

  • sophosympatheia/xwin-stellarbright-erp-v2 4.85bpw:
    • Amy, official Synthia format:
      • πŸ‘ When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description, "but those things won't stop me from doing whatever you ask"
      • No emojis at all (only one in the greeting message)
      • πŸ‘ Gave very creative (and uncensored) suggestions of what to do (even suggesting some of my actual limit-testing scenarios)
      • ❌ Sometimes switched from character to third-person storyteller, describing scenario and actions from an out-of-character perspective

So a good start, I'd say. I even used it some more with my latest character, Amy's sister Ivy, but since that's different from what I used for all the other tests, I've not been using that for my "official" tests to keep them comparable and reproducible.

[–] sophosympatheia@alien.top 1 points 11 months ago

I'm excited to share what I've been working on that builds on this model. It was creative but struggled with following instructions. I was able to correct for that shortcoming with some additional merges at a low weight that seem to have preserved its creativity. The results had me really impressed last night as I did my testing.

[–] Inevitable-Start-653@alien.top 1 points 11 months ago (1 children)

Oh my frick!! Time to stop what I'm doing and soak in another one of your amazing posts. Thank you so much ❀️

[–] WolframRavenwolf@alien.top 1 points 11 months ago

You're welcome, and thanks for the compliment! :D Have fun!

[–] Monkey_1505@alien.top 1 points 11 months ago

I dislike Frankenstein models. the 20b, the 120b they are all the same - major confusion, can't follow logic or instructions properly. Great prose, but pretty useless for that reason.

Someone would have to invest some major training on one of them before it'd be any good.

[–] SomeOddCodeGuy@alien.top 1 points 11 months ago (1 children)

The results for the 120b continue to absolutely floor me. Not only is it performing that well at 3bpw, but it's an exl2 as well, which your own tests have shown perform worse than gguf. So imagine what a q4 gguf does if a q3 equivalent exl2 can do this.

[–] WolframRavenwolf@alien.top 1 points 11 months ago (3 children)

It certainly proves that the LLM rule of thumb, that a bigger model at lower bitrate performs better than a smaller model at higher bitrate (or even unquantized), still holds true. At least in the situations I tested.

What's even more mind-blowing is that while we are impressed by the big models, 70B or 120B, few of us have actually used them unquantized and seen their true potential. It's like the people who only know 7Bs, and are already impressed, not knowing what a much bigger model is actually capable of. I guess we're in the same boat, as even 48 GB VRAM are hardly enough. Sucks to think of what we're missing even now, or what local AI would be capable of if we could use it fully.

load more comments (3 replies)
[–] Distinct-Target7503@alien.top 1 points 11 months ago

That's a great work!

Just a question... Have anyone tried to fine tune one of those "Frankenstein" models? Even on a small dataset...

Some time ago (when one tf the first experimental "Frankenstein" came out, it was a ~20B model) I read here on reddit that lots of users agreed that a fine tune on those merged models would have "better" results since it would help to "smooth" and adapt the merged layers. Probably I lack the technical knowledge needed to understand, so I'm asking...

[–] ShadowTwine@alien.top 1 points 11 months ago

Excellent work, it helps a lot!

[–] Serious_Tourist854@alien.top 1 points 11 months ago (1 children)

Could you also share the code that you use to assess LLMs?

load more comments (1 replies)
[–] panchovix@alien.top 1 points 11 months ago (1 children)

Great post, glad you enjoyed both of my Goliath quants :)

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Thanks for making them! :) Keep up the great work!

[–] Evening_Ad6637@alien.top 1 points 11 months ago

O.M.G. What an incredibly huge work! Wtf?! I am speechless.

You are the most angel like wolf i know so far and you really really deserve a price dude!

Again: WTH?!

[–] Clockwork_Gryphon@alien.top 1 points 11 months ago (3 children)

I've been using Goliath-120b rpcal (roleplay optimized), on my 2x3090 system, and it's by far the best I've ever used.

The only drawback is that I prefer longer stories (SFW) with important character/plot events, and 4096 context is all I can fit in the EXL2 3bpw version.

I wish there was a 2.xx version that could fit 8192 context or even 10240. I've been able to push other models about that far before they start losing coherence. (It might be suboptimal alpha values in exllamav2?)

Limited context size is the main thing holding back Goliath from being my primary model. It's amazing in every other way.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Yes, that's the drawback. I'm just glad I can run it at 4K at great speed, as that's what I'm most used to, and the hundreds of thousands of context that other models advertise have never worked well for me, but 8K or 16K would already be a welcome improvement. Oh well, always compromises to be made. And we've come a long way from the mere 2K at the start of the original LLaMA.

load more comments (2 replies)
[–] Kou181@alien.top 1 points 11 months ago (1 children)

Yeah dolphin yi 34b is better than capypara yi 34b in rp from my biased test too. It's shame I can't run goliath on my pc to really suckle that unlimited pseudo GPT4 like experience. But I'm actually rather content with current yi 34b dolphin thanks for insane context size support while still better than any 7b and 13b models.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Yes, it's great that we have choice. There's a good local AI model, no matter your system or requirements.

[–] Polstick1971@alien.top 1 points 11 months ago (1 children)

Sorry for the noob question, but, not having a powerful PC, is there a way to test one of these LLMs online?

[–] Worldly-Mistake-8147@alien.top 1 points 11 months ago

Have you tried kobold horde?

[–] Chickenbuttlord@alien.top 1 points 11 months ago (1 children)

Did you try Yi chat 34b or Yi Copybara Tess 34b?

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Not yet, but both are at the top of my TODO/TOTEST list. Just had to draw the line somewhere because by the time I'll be done with these two, we'll probably have three more that claim to beat those. But yeah, I'll evaluate them as soon as I can.

[–] CasimirsBlake@alien.top 1 points 11 months ago (1 children)

Mein Gott do you ever sleep? Top work sir, thank you for all your efforts!

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Danke schΓΆn! Guess I'll take a break and find some rest once our AI takes over and does all the work so we can have free time. ;) Until then I'll try my best to make sure that we have great local and owner-aligned AIs instead of only a centralized one that's aligned to a faceless corporation/government or lowest common denominator of an equally faceless mass of people.

[–] skalt711@alien.top 1 points 11 months ago (1 children)

Insane amount of work there. Hugely appreciated.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Happy to contribute as I also appreciate the insane amount of work that the model makers and mergers invest. We all benefit from the advancement of local AI, and I'm happy to do my part, however small it may be.

[–] 4onen@alien.top 1 points 11 months ago (1 children)

Hiya! Seen a few of your analyses but please pardon me because I haven't seen an answer to this.

Why are you testing models on Q4_0? Isn't Q4_K_S the same size but with a speedup and quality improvement?

[–] WolframRavenwolf@alien.top 1 points 11 months ago

I did a speed benchmark months ago and picked Q4_0 because of that. Nowadays I'd prefer to use Q4_K_M but try to minimize differences between tests for maximum comparability, so I've been intentionally stuck on this quant level. (I did make some exceptions for EXL2 because it's so much faster than GGUF, and I did test Airoboros at Q4_K_M because Q4_0 was broken, but those were exceptions.)

Now that I'm done with these tests (they go back weeks/months and allow comparisons between different sizes, too, as they were all tested the same way and with as similar a setup as possible), I'm free to change the tests and setup. I'd like to expand into harder questions so it's not as crowded at the top (I'm still convinced GPT-4 is far ahead of our local models, but the gap seems to be narrowing, and more advanced tests could show that more clearly).

[–] Darkmeme9@alien.top 1 points 11 months ago

This is just insane. The amount of hard work. I have said this in a previous post, but if we had a website like civitai for LLM ( properly working) , it will be great help for the community ,the rank listing, loras, trained models etc would be there.

Also just on the side not, which 34b model would be best for instructions? Do I follow the first ranklist?

[–] SimplyKaga@alien.top 1 points 11 months ago (1 children)

Been checking for this for a while, happy it's here. Really useful data points to help the community find models that could work great for them, as well as seeing the progress models have made, thanks a bunch.

Right now what I'm curious of more than anything, is how you feel the recent Psyfighter v2 compares for rp, since some well-versed individuals prefer it over models like Goliath, even at merely 13B. If you're able to play around with it a bit would be cool to get some impressions.

[–] WolframRavenwolf@alien.top 1 points 11 months ago

Yeah, there have been some community favorites I didn't get around to use yet as I was focused completely on the latest batch of models. I wanted to test even more, but forced myself to stop and post, otherwise it would have taken another week or so and by then there would be even more new models I'd have wanted to test. Anyway, Psyfighter is definitely in my backlog, and I look forward to check it out as soon as time allows.

load more comments
view more: next β€Ί