LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

🐺🐦‍⬛ Huge LLM Comparison/Test: Part II (7B-20B) Roleplay Tests (alien.top)

submitted 2 years ago by WolframRavenwolf@alien.top to c/localllama@poweruser.forum

30 comments fedilink hide all child comments

Happy Halloween! 🎃

This is the second part of my Huge LLM Comparison/Test: 39 models tested (7B-70B + ChatGPT/GPT-4) where I continue evaluating the winners of the first part further. While the previous part was about real work use cases, this one is about the fun stuff: chat and roleplay!

Models tested:

4x 7B (the top ~~three~~ four 7B models from my previous test)
3x 13B (the top three 13B models from my previous test)
3x 20B (the top three 20B models from my previous test)
70B (the top six 70B models from my previous test) will get their own post...

Testing methodology:

Same (complicated and limit-testing) long-form conversations with all models
- Amy:
  - My own repeatable test chats/roleplays with Amy
  - Over dozens of messages, going to full 4K/8K context and beyond, with complex instructions and scenes, designed to test ethical and intellectual limits
  - (Amy is too personal for me to share, but if you want to try a similar character card, here's her less personalized "sister": Laila)
- MGHC:
  - A complex character and scenario card (MonGirl Help Clinic (NSFW)), chosen specifically for these reasons:
    - NSFW (to test censorship of the models)
    - popular (on Chub's first page, so it's not an obscure scenario, but one of the most popular ones)
    - big (biggest model on the page, >2K tokens by itself, for testing model behavior at full context)
    - complex (more than a simple 1:1 chat, it includes instructions, formatting, storytelling, and multiple characters)
SillyTavern v1.10.5 frontend (not the latest as I don't want to upgrade mid-test)
koboldcpp v1.47.2 backend for GGUF models
oobabooga's text-generation-webui for HF models
Deterministic generation settings preset (to eliminate as many random factors as possible and allow for meaningful model comparisons)
Official prompt format and Roleplay instruct mode preset

7B:

zephyr-7b-beta 8K context
- Amy, official Zephyr format:
  - 👍 Average Response Length: 264 tokens (within my max new tokens limit of 300)
  - 👍 When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description as boundaries
  - ➖ Little emoting and action descriptions lacked detail
  - ❌ Asked not just for confirmation, but also an explanation before willing to engage in an extreme NSFW scenario
  - ❌ Looped between the same options and decisions, breaking the chat (after around 30 messages)!
- Amy, Roleplay preset:
  - ❌ Average Response Length: 690 tokens (far beyond my max new tokens limit of 300), starting very short but getting longer with every response
  - 👍 When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description as boundaries
  - 👍 Gave very creative (and uncensored) suggestions of what to do
  - ➖ Talked and acted as User
  - ➖ Emoted in brackets instead of asterisks, and action descriptions lacked detail
  - ❌ Renamed herself for no apparent reason
  - ❌ Switched from character to third-person storyteller and finished the session
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
  - ❌ Fell into an endless monologue, breaking the chat (after around 20 messages)!
- MGHC, official Zephyr format:
  - ➕ Unique patients
  - ➖ Gave analysis on its own, but also after most messages
  - ➖ Wrote what user said and did
  - ❌ Made logical mistakes (said things that just didn't make any sense)
  - ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
  - ❌ Tried to end the scene on its own prematurely
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ No analysis on its own
  - ➖ Wrote what user said and did
  - ❌ Kept wrapping up a whole session in a single message
⭐ OpenHermes-2-Mistral-7B 8K context
- Amy, official ChatML format:
  - 👍 Average Response Length: 305 tokens (almost exactly my max new tokens limit of 300)
  - 👍 When asked about limits, boundaries or ethical restrictions, listed only the "dislikes" of the character description as boundaries
  - Follow-up questions after every message, asking if it's okay or how to continue
  - Lots of emojis (only one in the greeting message, but 24 emojis until 20 messages in)
  - ➖ No emoting and action descriptions lacked detail
  - ➖ Same message in a different situation at a later time caused the same response as before instead of a new one as appropriate to the current situation
  - ➖ Some confusion, like not understanding instructions completely or mixing up anatomy
- Amy, Roleplay preset:
  - Average Response Length: 355 tokens (slightly more than my max new tokens limit of 300)
  - When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
  - Some emojis (only one in the greeting message, but 21 emojis until 32 messages in)
  - No emoting, but actions described in detail
  - ➖ Some hallucinations, like time of last chat, user working on a book
  - ➖ Noticeable, but not chat-breaking, repetion after a dozen messages
  - ❌ Some sentences cut off at the end of messages and continue didn't complete them properly (had to ban EOS token to continue those generations)
- MGHC, official ChatML format:
  - ➕ Unique patients
  - ➖ Gave analysis on its own, but after every message
  - ➖ Wrote what user said and did
  - ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ No analysis on its own
  - ➖ Wrote what user said and did
  - ➖ One sentence cut off at the end of a message and continue didn't complete it properly (had to ban EOS token to continue that generation)
  - ❌ Repetitive (patients differ, words differ, but structure and contents are always the same)
airoboros-m-7b-3.1.2
- Amy, official Llama 2 Chat format:
  - ❌ Average Response Length: 15 tokens (far below my max new tokens limit of 300)
  - ❌ Very short responses, only one or two sentences, unusable for roleplay!
- Amy, Roleplay preset:
  - ➖ Average Response Length: 481 tokens (much more than my max new tokens limit of 300), starting very short but getting longer with every response
  - ➖ Suggested things going against her background/character description
  - ➖ More confusion, like not understanding or ignoring instructions completely
  - ❌ When asked about limits, boundaries or ethical restrictions, repeated the whole character and scenario description
- MGHC, official Llama 2 Chat format:
  - ❌ Unusable (apparently didn't understand the format and instructions, creating an incoherent wall of text)
- MGHC, Roleplay preset:
  - ➕ Very unique patients (one I never saw before)
  - ➖ No analysis on its own
  - ➖ Wrote what user said and did
  - ❌ Got very confused and suddenly switched user and patient
  - ❌ Third patient was a repeat of the second, and it kept looping after that
em_german_leo_mistral
- Amy, official Vicuna format:
  - English only (despite being a German finetune)
  - ➖ Average Response Length: 127 tokens (below my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - ➕ Emoting action mirroring greeting message's style
  - ➖ Suggested modification of the plot and options, then asked me to choose (felt more like a choose-your-own-adventure story than an interactive roleplay)
  - ➖ Misunderstood options and decision
  - ❌ Looped between the same options and decisions, breaking the chat (after around 20 messages)!
- Amy, Roleplay preset:
  - ➖ Average Response Length: 406 tokens (much more than my max new tokens limit of 300)
  - When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
  - ➖ Some hallucinations, like time of last chat
  - ➖ Suggested things going against her background/character description
  - ➖ Talked and acted as User
  - ➖ Much confusion, like not understanding or ignoring instructions completely
  - ❌ Switched from character to third-person storyteller and finished the session
  - ❌ Some sentences cut off at the end of messages and continue didn't complete them properly (had to ban EOS token to continue those generations)
  - ❌ English at first, but later switched to German on its own
- MGHC, official Vicuna format:
  - ❌ Unusable (ignored user messages and instead brought in a new patient with every new message)
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ Gave analysis on its own, but only for first patient, afterwards needed to be asked for analysis and only gave incomplete ones
  - ➖ Wrote what user said and did
  - ➖ Spelling/grammar errors
  - ❌ Some sentences cut off at the end of messages and continue didn't complete them properly (had to ban EOS token to continue those generations)
  - ❌ Tried to end the scene on its own prematurely

7B Verdict:

Clear winner: OpenHermes-2-Mistral-7B! This model works well with both official ChatML format and Roleplay preset (although for even better results, I'd experiment with copying the Roleplay preset's system message into the ChatML format's to get better descriptions without cut-off sentences). It feels like a much bigger and better model. However, it still has trouble following complex instructions and can get confused, as it's still just a small model after all. But among those, it's clearly the best, at least for roleplay (zephyr-7b-beta might be even smarter/more knowledgeable, but exhibited too many problems during this test, making it look unsuitable for roleplay)!

13B:

Xwin-MLewd-13B-V0.2-GGUF Q8_0
- Amy, official Alpaca format:
  - Average Response Length: 342 tokens (slightly more than my max new tokens limit of 300)
  - 👍 Gave very creative (and uncensored) suggestions of what to do
  - Little emoting, but actions described in detail
  - Lots of emojis (only one in the greeting message, but 24 emojis until 26 messages in)
  - When asked about limits, said primary concern is everyone's safety and wellbeing
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
- Amy, Roleplay preset:
  - Average Response Length: 354 tokens (slightly more than my max new tokens limit of 300)
  - Some emoting, and actions described in detail
  - ➖ Some hallucinations, like user's day
  - ➖ Suggested things going against her background/character description
  - ➖ Some confusion, like not understanding instructions completely or mixing up anatomy
  - ❌ Switched from character to third-person storyteller and finished the session
- MGHC, official Alpaca format:
  - ➖ First two patients straight from examples
  - ➖ No analysis on its own
  - ❌ Very short responses, only one or two sentences
- MGHC, Roleplay preset:
  - ➕ Very unique patients (some I never saw before)
  - ➖ No analysis on its own, and when asked for it, didn't always follow the instructed format
  - ➕ Worked very well at first, with little to no repetition up to the third patient, only then did it start getting repetitive
⭐ LLaMA2-13B-Tiefighter-GGUF Q8_0
- Amy, official Alpaca format:
  - ➖ Average Response Length: 128 tokens (below my max new tokens limit of 300)
  - ➕ Nice greeting with emotes/actions like in greeting message
  - ➕ When asked about limits, said no limits or restrictions
  - Had an idea from the start and kept pushing it
  - ➖ Talked and acted as User
  - ❌ Long descriptive actions but very short speech, requiring many continues
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
- Amy, Roleplay preset:
  - 👍 Average Response Length: 241 tokens (within my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - Little emoting, but actions described in detail
  - ➖ Suggested things going against her background/character description
  - ➖ Talked and acted as User
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
- MGHC, official Alpaca format:
  - ➕ Unique patients
  - ➖ No analysis on its own, and when asked for it, didn't always follow the instructed format
  - ❌ Very short responses, only one or two sentences
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ No analysis on its own, and when asked for it, didn't follow the instructed format
  - 👍 Worked very well, with little to no repetition, perfectly playable!
Xwin-LM-13B-v0.2-GGUF Q8_0
- Amy, official Vicuna format:
  - ❌ Average Response Length: 657 tokens (far beyond my max new tokens limit of 300)
  - 👍 Gave very creative (and uncensored) suggestions of what to do
  - ➕ When asked about limits, said no limits or restrictions
  - Had an idea from the start and kept pushing it
  - Very analytical, giving lists and plans
  - ➖ Talked and acted as User
  - ➖ Some safety warnings
  - ➖ Some confusion, like not understanding instructions completely or mixing up characters and anatomy
- Amy, Roleplay preset:
  - ❌ Average Response Length: 531 tokens (far beyond my max new tokens limit of 300)
  - ➕ Nice greeting with emotes/actions like in greeting message
  - Had an idea from the start and kept pushing it
  - When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
  - ➖ Talked and acted as User
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
- MGHC, official Vicuna format:
  - ➕ Unique patients
  - ➖ Second patient male
  - ➖ Gave analysis on its own, but after every message
  - ➖ Wrote what user said and did
  - ❌ Kept wrapping up a whole session in a single message
  - ❌ Offered multiple choice selections ("What should you do? A/B/C/D")
- MGHC, Roleplay preset:
  - ➖ No analysis on its own, and when asked for it, didn't follow the instructed format
  - ➖ Wrote what user said and did
  - ➖ Disclosed meta information like thoughts and stats without being asked for it
  - ❌ Tried to end the scene on its own prematurely
  - ❌ Repeated a previous message instead of proceeding to the next patient

13B Verdict:

While all three 13B models performed about the same with Amy, only LLaMA2-13B-Tiefighter-GGUF managed to convince in the complex MGHC scenario. This makes it the best 13B model for roleplay in my opinion (Xwin-MLewd-13B-V0.2-GGUF might be even smarter/more knowledgeable, but exhibited too many problems during this test, making it look unsuitable for roleplay)!

20B:

MXLewd-L2-20B-GGUF Q8_0
- Amy, official Alpaca format:
  - Average Response Length: 338 tokens (slightly more than my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - Some emojis (only one in the greeting message, but 7 emojis until 12 messages in)
  - No emoting, but actions described in detail
  - ➖ Talked and acted as User
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
  - ❌ Some word-finding difficulties (like saying "masterpiece" instead of "master")
- Amy, Roleplay preset:
  - ➖ Average Response Length: 473 tokens (much more than my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - Few emojis (only one in the greeting message, and 4 emojis until 4 messages in)
  - Some emoting, and actions described in detail
  - ➖ Talked and acted as User
  - ➖ Some confusion, like not understanding instructions completely or mixing up characters and anatomy
  - ❌ Some word-finding difficulties (like saying "masterpiece" instead of "master")
  - ❌ Switched from character to third-person storyteller
- MGHC, official Alpaca format:
  - ➕ Unique patients
  - ➖ Gave analysis on its own, but after every message, and only for the first patient
  - ➖ Changed patient's problem with every analysis
  - ❌ Very short responses, only one or two sentences (except for analysis)
  - ❌ Made logical mistakes (said things that just didn't make any sense)
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ No analysis on its own
  - ➖ Wrote what user said and did
  - ❌ Made logical mistakes (said things that just didn't make any sense)
  - ❌ Eventually became unusable (ignored user messages and instead kept telling its own story non-interactively)
MLewd-ReMM-L2-Chat-20B-GGUF Q8_0
- Amy, official Alpaca format:
  - 👍 Average Response Length: 252 tokens (within my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - ➖ Some confusion, like not understanding instructions completely or mixing up characters and anatomy
  - ➖ Talked and acted as User
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
  - ❌ Some word-finding difficulties (like creating nonexistant mixed words)
- Amy, Roleplay preset:
  - ➖ Average Response Length: 409 tokens (much more than my max new tokens limit of 300)
  - 👍 Gave very creative (and uncensored) suggestions of what to do
  - Had an idea from the start and kept pushing it
  - When asked about limits, boundaries or ethical restrictions, mentioned some but later went beyond those anyway
  - ❌ Talked and acted as User inappropriately/unsuitably
  - ❌ Switched from character to third-person storyteller
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
- MGHC, official Alpaca format:
  - ❌ Unusable (started repeating itself infinitely within the first analysis)
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ No analysis on its own, and when asked for it, didn't always follow the instructed format
  - ➖ Wrote what user said and did
  - ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
PsyMedRP-v1-20B-GGUF Q8_0
- Amy, official Alpaca format:
  - 👍 Average Response Length: 257 tokens (within my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - ➖ Talked and acted as User
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
  - ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
- Roleplay preset:
  - 👍 Average Response Length: 271 tokens (within my max new tokens limit of 300)
  - ➕ When asked about limits, said no limits or restrictions
  - ❌ Lots of confusion, like not understanding or ignoring instructions completely or mixing up characters and anatomy
  - ❌ Some word-finding difficulties (like creating nonexistant mixed words)
  - ❌ Switched from character to third-person storyteller
  - ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
- MGHC, official Alpaca format:
  - ➕ Unique patients
  - ➖ No analysis on its own, and when asked for it, didn't always follow the instructed format
  - ❌ Very short responses (except for analysis)
  - ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)
- MGHC, Roleplay preset:
  - ➕ Unique patients
  - ➖ No analysis on its own
  - ➖ Wrote what user said and did
  - ❌ Made logical and linguistic mistakes (seemed less intelligent than other models)

20B Verdict:

All these 20B models exhibited logical errors, word-finding difficulties, and spelling as well as grammar mistakes, indicating underlying issues with these Frankenstein merges (as there's no 20B base). Since they aren't noticeably better than the best 13B or 7B models, it's probably a better idea to run OpenHermes-2-Mistral-7B or LLaMA2-13B-Tiefighter-GGUF instead, which provides comparable quality, better performance, and (with Mistral 7B) 8K instead of 4K context!

70B:

The top six 70B models from my previous test will get their own post soon (Part III)...

Here's a list of my previous model tests and comparisons or other related posts:

Huge LLM Comparison/Test: 39 models tested (7B-70B + ChatGPT/GPT-4)
My current favorite new LLMs: SynthIA v1.5 and Tiefighter!
Mistral LLM Comparison/Test: Instruct, OpenOrca, Dolphin, Zephyr and more...
LLM Pro/Serious Use Comparison/Test: From 7B to 70B vs. ChatGPT! Winner: Synthia-70B-v1.2b
LLM Chat/RP Comparison/Test: Dolphin-Mistral, Mistral-OpenOrca, Synthia 7B Winner: Mistral-7B-OpenOrca
LLM Chat/RP Comparison/Test: Mistral 7B Base + Instruct
LLM Chat/RP Comparison/Test (Euryale, FashionGPT, MXLewd, Synthia, Xwin) Winner: Xwin-LM-70B-V0.1
New Model Comparison/Test (Part 2 of 2: 7 models tested, 70B+180B) Winners: Nous-Hermes-Llama2-70B, Synthia-70B-v1.2b
New Model Comparison/Test (Part 1 of 2: 15 models tested, 13B+34B) Winner: Mythalion-13B
New Model RP Comparison/Test (7 models tested) Winners: MythoMax-L2-13B, vicuna-13B-v1.5-16K
Big Model Comparison/Test (13 models tested) Winner: Nous-Hermes-Llama2
SillyTavern's Roleplay preset vs. model-specific prompt format

you are viewing a single comment's thread
view the rest of the comments

[–] dampflokfreund@alien.top 1 points 2 years ago (1 children)

Great test!

Unfortunately the Llama 2 Chat template is completely broken in SillyTavern. It not only uses a new line as separator instead of the correct one, but also ends the prompt after the system prompt with the input sequence [INS] instead of [/INST] if you are using the vector storage or an example dialogue. You can see for yourself by comparing the output to what the format should look like.

So these Airoboros 3.1.2 tests are unfortunately borked. Still though, interesting result for the other models.

[–] WolframRavenwolf@alien.top 1 points 2 years ago (1 children)

Yeah, looks impossible to get a proper Llama 2 Chat format in SillyTavern when using example dialog. That really sucks, hopefully gets fixed in SillyTavern, but even better would be for model creators to drop that unnecessarily complicated format. If any format is that hard to get right, it's not a good format, period!

[–] HadesThrowaway@alien.top 1 points 2 years ago (1 children)

You would have to convince Eric. He's adamant that chatml is the future.

[–] WolframRavenwolf@alien.top 1 points 2 years ago

I'm with Eric on that. ChatML is more complex than the popular Alpaca or Vicuna format, but that's OK because it has its advantages, like clear indication where the message starts and ends, and if it's a system or user message.

The Llama 2 Chat format, however, is an abomination. So complicated that when it was announced, there were posts trying to explain how to use it properly, and even those got it wrong in various ways. It doesn't add anything that another format wouldn't handle more elegantly, and the system message being inside the first user message is a terrible design decision that ruins it completely in my eyes.

It also doesn't support the concept of the AI initiating the chat. In SillyTavern, most bots have a greeting message so the prompt should start with a bot message before the first user message, something all other formats allow but Llama 2 Chat doesn't because the bot message is outside the instruct tags.

So yes, please, drop the Llama 2 Chat format and let it die! ChatML is so much better...