this post was submitted on 15 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Local LLMs are wonderful, and we all know that, but something that's always bothered me is that nobody in the scene seems to want to standardize or even investigate the flaws of the current sampling methods. I've found that a bad preset can make a model significantly worse or golden depending on the settings.

It might not seem obvious, or it might seem like the default for whatever backend is already the 'best you can get', but let's fix this assumption. There are more to language model settings than just 'prompt engineering', and depending on your sampler settings, it can have a dramatic impact.

For starters, there are no 'universally accepted' default settings; the defaults that exist will depend on the model backend you are using. There is also no standard for presets in general, so I'll be defining the sampler settings that are most relevant:

- Temperature

A common factoid about Temperature that you'll often hear is that it is making the model 'more random'; it may appear that way, but it is actually doing something a little more nuanced.

A graph I made to demonstrate how temperature operates

What Temperature actually controls is the scaling of the scores. So 0.5 temperature is not 'twice as confident'. As you can see, 0.75 temp is actually much closer to that interpretation in this context.

Every time a token generates, it must assign thousands of scores to all tokens that exist in the vocabulary, and the temperature simply helps to either reduce (lowered temp) or increase (higher temp) the scoring of the extremely low probability tokens.

In addition to this, when Temperature is applied matters. I'll get into that later.

- Top P

This is the most popular sampling method, which OpenAI uses for their API. However, I personally believe that it is flawed in some aspects.

Unsure of where this graph came from, but it's accurate.

With Top P, you are keeping as many tokens as is necessary to reach a cumulative sum.

But sometimes, when the model's confidence is high for only a few options (but is divided amongst those choices), this leads to a bunch of low probability options being considered. I hypothesize this is a smaller part of why models like GPT4, as intelligent as they are, are still prone to hallucination; they are considering choices to meet an arbitrary sum.

Top K is doing something even more linear, by only considering as many tokens are in the top specified value, so Top K 5 = only the top 5 tokens are considered always. I'd suggest just leaving it off entirely if you're not doing debugging.

So, I created my own sampler which fixes both design problems you see with these popular, widely standardized sampling methods: Min P.

https://preview.redd.it/fl1jtv4qmg0c1.png?width=1002&format=png&auto=webp&s=1fbee0f73cd8c4160a569d88b5f14e2c3c3e9ef2

What Min P is doing is simple: we are setting a minimum value that a token must reach to be considered at all. The value changes depending on how confident the highest probability token is.

So if your Min P is set to 0.1, that means it will only allow for tokens that are at least 1/10th as probable as the best possible option. If it's set to 0.05, then it will allow tokens at least 1/20th as probable as the top token, and so on...

"Does it actually improve the model when compared to Top P?" Yes. And especially at higher temperatures.

Both of these hallucinate to some degree, of course, but there's a clear winner in terms of 'not going crazy'...

No other samplers were used. I ensured that Temperature came last in the sampler order as well (so that the measurements were consistent for both).

You might think, "but doesn't this limit the creativity then, since we are setting a minimum that blocks out more uncertain choices?"Nope. In fact, it helps allow for more diverse choices in a way that Top P typically won't allow for.

Let's say you have a Top P of 0.80, and your top two tokens are:

  1. 81%
  2. 19%

Top P would completely ignore the 2nd token, despite it being pretty reasonable. This leads to higher determinism in responses unnecessarily.

This means it's possible for Top P to either consider too many tokens or too little tokens depending on the context; Min P emphasizes a balance, by setting a minimum based on how confident the top choice is.

So, in contexts where the top token is 6%, a Min P of 0.1 will only consider tokens that are at least 0.6% probable. But if the top token is 95%, it will only consider tokens at least 9.5% probable.

0.05 - 0.1 seems to be a reasonable range to tinker with, but you can go higher without it being too deterministic, too, with the plus of not including tail end 'nonsense' probabilities.

- Repetition Penalty

This penalty is more of a bandaid fix than a good solution to preventing repetition; However, Mistral 7b models especially struggle without it. I call it a bandaid fix because it will penalize repeated tokens even if they make sense (things like formatting asterisks and numbers are hit hard by this), and it introduces subtle biases into how tokens are chosen as a result.

I recommend that if you use this, you do not set it higher than 1.20 and treat that as the effective 'maximum'.

Here is a preset that I made for general purpose tasks.

https://preview.redd.it/dplbhjp6tg0c1.png?width=1024&format=png&auto=webp&s=46cb60d46382ad3736f170998a855743ee98d197

I hope this post helps you figure out things like, "why is it constantly repeating", or "why is it going on unhinged rants unrelated to my prompt", and so on.

There's a lot more I could write about, and I'm also going to write a proper research paper on this. I mainly wanted to share this because I thought it was severely underlooked.

Anyways, I hope this post helps people figure out questions like, "why does this preset work better for me?" or "what do these settings even do?". I've been talking to someone who does model finetuning who asked about potentially standardizing settings + model prompt formats in the future and getting in talks with other devs to make that happen.

The more 'experimental' samplers I have excluded from this writeup, as I personally see no benefits when using them. These include Tail Free Sampling, Typical P / Locally Typical Sampling, and Top A (which is a non-linear version of Min P, but seems to perform worse in my subjective opinion). Mirostat is interesting but seems to be less predictable and can perform worse in certain contexts (as it is not a 'context-free' sampling method).

top 42 comments
sorted by: hot top controversial new old
[–] extopico@alien.top 1 points 11 months ago

Which koboldccp allows you to set the samplers order? The latest main branch does not have this available, in linux.

[–] drifter_VR@alien.top 1 points 11 months ago (1 children)

Just tried Min-P with the last versions of sillytavern and koboldcpp and... the outputs were pretty chaotic... not sure if Koboldcpp is supporting Min-P yet

SillyTavern has Min-P support, but I'm not sure if it works with all backends yet. In 1.10.9's changelog, Min-P was hidden behind a feature flag for KoboldCPP 1.48 or Horde.

[–] Haiart@alien.top 1 points 11 months ago (1 children)

It's working perfectly fine for me in KoboldCPP.

Check if you forgot to disable any other sampling methods, you have to disable everything and leave (Top-p at 1, Top-K at 0, Top-A at 0, Typ. at 1, TFS at 1, Seed at -1 and Mirostat Mode OFF) ONLY Min-p enabled (and, if you NEED, you can activate Repetition Penalty at 1.05~1.20 at maximum and I personally use RpRng. 2048 RpSlp. 0.9 but don't bother with these, only if you enable Repetition Penalty.)

Also, with Min-p, you should be using higher Temperature, start with Temperature at 1.5 and Min-p at 0.05, then you can finetune these two numbers at will, read the post to understand why.

[–] drifter_VR@alien.top 1 points 11 months ago (2 children)

Well I tried the settings given by OP with temp=1.0, will try with higher temps, thanks.

[–] Haiart@alien.top 1 points 11 months ago

Great, also, remember to always keep an eye on the KoboldCPP github for updates, I noticed that when you said two days ago that you were using 1.48 they already version 1.50 there.

[–] nixudos@alien.top 1 points 11 months ago

I'm having a lot of fun with it on the following settings for story writing.
I feel like there is loads of grat potential in min_P, once I get it dialed in!

https://preview.redd.it/9in73daoix1c1.png?width=619&format=png&auto=webp&s=3f51101d0a40c02ef46de163a707164d28a68f7f

[–] CardAnarchist@alien.top 1 points 1 year ago (1 children)

Hi thanks a lot for this, I haven't seen a good guide to these settings until now.

As someone who always runs mistral 7B models I have two questions,

  1. For a general default for all mistral models would you recommend a Repetition Penalty setting of 1.20?

  2. I run Mistral models at 8192 context. What should I set the Repetition Penalty Range at?

Thanks again for the great info and of course for making Min P!

[–] Broadband-@alien.top 1 points 1 year ago (1 children)

I've experimented with turning repetition penalty off completely and haven't noticed much of a change so far.

[–] CardAnarchist@alien.top 1 points 1 year ago

I setup exactly as OP's example showed but with 1.20 Repetition Penalty. The output was.. quite bad, worse than I was getting before tampering with all the settings.

I changed Repetition Penalty Range to match my context (8192) and that improved the output but it was still pretty bad.

I tried Repetition Penalty of 1.0 and that was much better but it tended to repeat after a bit (A common Mistral problem).

I tried 1.1 Repetition Penalty and it was close but still a bit too dumb / random.

1.05 Repetition Penalty seems to be a nice sweet spot for me atm. I do think the output is now better than what I had previously.

Strange you don't see much diff with the Repetition Penalty setting. It massively alters my outputs (when setup like OP).

I'm using OpenChat 3.5 7B for reference.

Thanks! V informative, will keep for reference👍🏼

[–] ProperShape5918@alien.top 1 points 1 year ago

Needed to use a language model just to read this.

[–] ReMeDyIII@alien.top 1 points 1 year ago (1 children)

I find it comical it took this long to get a proper dissection of what these settings meant and to no surprise it spikes to 387 upvotes in 13 hours.

[–] Excessive_Etcetra@alien.top 1 points 1 year ago

I could have used this four months ago, lol. Thank you OP for finally making it make sense.

[–] FPham@alien.top 1 points 1 year ago

Proof is in the pudding - blind tests, just like ooba did a while ago with the older samplings.

Language is way too complex to approach it from the math side and assert "this should work better". In theory yes, but we need blind tests.

[–] berzerkerCrush@alien.top 1 points 1 year ago

That's a high quality post!

[–] Super_Pole_Jitsu@alien.top 1 points 1 year ago

This is absolutely golden, and is probably the reason for the absolutely shit performance I got on my local models. You should definitely write a paper about this!

[–] nsfw_throwitaway69@alien.top 1 points 1 year ago

min P seems similar to tail free sampling. I think the difference is that TFS tries to identify the "tail" by computing the derivative of the token probability function.

[–] sophosympatheia@alien.top 1 points 1 year ago

Awesome post! Thanks for investing the time into this, u/kindacognizant.

I have been playing around with your suggested Min-P settings and they kick butt. It feels close to mirostat subjectively, certainly no worse, and you made some convincing arguments for the Min-P approach. I like the simplicity of it too. I think I'll be using Min-P primarily from now on.

[–] _Andersinn@alien.top 1 points 11 months ago

Thank you - I used too think I was the only one who had no idea how any of this works.

[–] dnsod_si666@alien.top 1 points 11 months ago

This may be a dumb question, but why do we use any sampling modifications at all? Is that not defeating the purpose of the model training to learn those probabilities?

[–] Dead_Internet_Theory@alien.top 1 points 11 months ago (1 children)

OP, this post is fantastic.

I wonder, is this a case of the community doing free R&D for OpenAI or they truly have a good reason for using naive sampling?

Also the graph comes from here, a bunch of other graphs there too.

[–] kindacognizant@alien.top 1 points 11 months ago

I posted that GitHub issue. That original Top K vs Top P graph wasn't made by me, I can't find the original source, but I made the Min P one and others.

[–] psi-love@alien.top 1 points 11 months ago

Really nice explanation, thank you!

So if I only want min_p sampling of 0.05 to work with llama.cpp for example, which values should other sampling parameters like top_k (0?), top_p (1.0?) and temperature (1.0?) use, so they have no influence?

[–] nggakmakasih@alien.top 1 points 11 months ago

Is this available in Text Generation Inference (Hugging face TGI)?

[–] silenceimpaired@alien.top 1 points 11 months ago (1 children)

So helpful… but Yi and llamacpp_hf just falls apart for me… complete gibberish on Oobabooga. Exl hf … fine. Llama.cpp fine… Min-P is there and I can apparently use it but temperature last is missing :/

[–] kindacognizant@alien.top 1 points 11 months ago (1 children)

Temperature last is the assumed default of llama.cpp which means it is working.

Unfortunately the HF loader seems to have a bug with Min P in Ooba.

[–] silenceimpaired@alien.top 1 points 11 months ago (1 children)

Well then… Thanks! I’ll use llama.cpp and be happy. Glad to hear llamacpp_hf is crazy and not me. Which tool do you prefer outside of Oobabooga?

[–] kindacognizant@alien.top 1 points 11 months ago

Koboldcpp! Single exe, runs with very little dependency bloat, and is still blazing fast as long as you can offload the whole model.

[–] Monkey_1505@alien.top 0 points 1 year ago (1 children)

I use Tail Free Sampling all the time, exclusively and I never touch anything else.

[–] kindacognizant@alien.top 0 points 1 year ago (1 children)

What frontends do you use?

[–] Monkey_1505@alien.top 1 points 1 year ago (1 children)
[–] empire539@alien.top 1 points 11 months ago (1 children)

SillyTavern has Min-P support, but I'm not sure if it works with all backends yet. In 1.10.9's changelog, Min-P was hidden behind a feature flag for KoboldCPP 1.48 or Horde.

[–] drifter_VR@alien.top 1 points 11 months ago (1 children)

Just tried Min-P with the last versions of sillytavern and koboldcpp and... the outputs were pretty chaotic...

[–] _Erilaz@alien.top 1 points 11 months ago (1 children)

What settings did you use? I didn't run into this issue, in fact, min-P sampling helps taming the chaotic nature of some models without putting them on rails.

[–] drifter_VR@alien.top 1 points 11 months ago (1 children)

I uses the settings given by OP with temp=1 et min-P=0.1

[–] _Erilaz@alien.top 1 points 11 months ago (1 children)
[–] drifter_VR@alien.top 1 points 11 months ago (1 children)

I mostly use 34b models now but I must admit those models are already a bit cahotic by nature haha

[–] _Erilaz@alien.top 1 points 11 months ago

Which 34B? CodeLLaMAs or Yi-34B?

[–] Blacksmith_Strange@alien.top 0 points 1 year ago (2 children)

What settings would you recommend for GPT-4 turbo?

[–] SkillDistinct4940@alien.top 1 points 11 months ago

I’m using OpenAI gpt models. I’m struggling to get consistent and same response for an app I’m trying to make which requires the llm to response deterministic and same each time according to the prompt that we feed into it. I’m getting mixed results currently with defaults

What top p and temperature settings should I provide it?

Would giving just temperature 0 the right thing?

Do I need to give top p too?

[–] SkillDistinct4940@alien.top 1 points 11 months ago (1 children)

I’m using OpenAI gpt models. I’m struggling to get consistent and same response for an app I’m trying to make which requires the llm to response deterministic and same each time according to the prompt that we feed into it. I’m getting mixed results currently with defaults

What top p and temperature settings should I provide it?

Would giving just temperature 0 the right thing?

Do I need to give top p too?

[–] Aphid_red@alien.top 1 points 11 months ago

If you want determinism, use a seed. The actual sampler settings shouldn't matter. This way you can have same output for same prompt always (and thus, for example, cache common prompts).

If you want to also 'measure' things about the model such as its perplexity, or the ability to see how well it can predict an existing text, use top k=1, temperature = 1.0, disable all other samplers, and correct it whenever it predicts the wrong next token. (Don't let the model generate more than one token at a time).