this post was submitted on 25 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

https://huggingface.co/kalomaze/MiniSymposium-Demo

MiniSymposium is an experimental model that I created based on Mistral 7b. I created it attempting to test these goals:

  1. Demonstrate the untapped potential of using a small, focused dataset of handwritten examples instead of training on a large amount of synthetic GPT outputs, by lowering the learning rate and doing many passes over the small dataset
  2. Create a dataset that allows the model to explore different possible answers from multiple perspectives before reaching a final conclusion ('Socratic prompting'?)
  3. Develop a model that performs well across various pseudo-markdown prompt formats, rather than overfitting to a specific kind of format such as ChatML, which should naturally benefit other general purpose use cases

The current trend in QLora/Lora-based finetuning (and finetuning in general for local LLMs) is to use large synthetic datasets. These are typically GPT-generated datasets trained with higher learning rates.

However, I believe there is a lot of potential in using small, hand-written datasets with low learning rates, even if it's for general-purpose instruction following, as long as you train it for many epochs on a learning rate low enough to avoid overfitting.

This approach, I hypothesize, helps the model to learn the deeper patterns of instruction following , including the small details. This should help to avoid shallow data biases (like "As an AI made by OpenAI" and other GPT-isms) that are irrelevant to deeper instruction following patterns, especially in long context and multiturn scenarios.

My initial configuration for this QLora model used a constant learning rate of 1e-6 (0.000001), which resulted in obvious, massive overfitting after about 100 epochs. The model started reproducing the original dataset almost verbatim, and exhibited poor generalization across different prompt formats, including obvious hallucinations & also Chinese language outputs for some reason.

However, turning down the learning rate to 1/10th of (1e-7, which is 0.0000001) significantly improved the model with the same exact small dataset. I trained for about ~10 hours on my RTX 3060 to 600 epochs; I think it's still a little undertrained, but I encourage people to try the demo model out in the meantime.

https://preview.redd.it/54imvd09ee2c1.png?width=1561&format=png&auto=webp&s=a0e603f5f5a960189b0d225ab5581f2a0339d12d

https://preview.redd.it/al6gmpuaee2c1.png?width=1132&format=png&auto=webp&s=5704aa41e87a5555664405d2f0178287bd7bde35

https://preview.redd.it/7fs90ictee2c1.png?width=1140&format=png&auto=webp&s=7f94c1d76493673d83e0d066efe9f43e21205fe7

It's designed to be very adaptable to different prompt formats and playing roles, and I've gotten some fun and sometimes surprisingly good outputs so far.

A few samples of the training data are formatted like this to help avoid blatant overconfidence in its outputs, to serve as a sort of self-correction mechanism:

https://preview.redd.it/vlmyw1smfe2c1.png?width=2448&format=png&auto=webp&s=4c2cfea77188b9529c2c0c1c1fe29af9d152f0bf

Let me know how this model goes. There's lots of merges of models that are all sort of doing the same thing, so I figured a more experimental approach would be appreciated. I think there is still more optimization for LR/epoch balance, and I'll probably add some more examples of specific tasks like Summarization in the dataset so that it's not *too* small (but still lightweight enough to generalize well).

top 20 comments
sorted by: hot top controversial new old
[–] opi098514@alien.top 1 points 11 months ago

I’ll fuck around with it when I get home.

[–] Revolutionalredstone@alien.top 1 points 11 months ago (5 children)

Multiple passes at lower learning rates isn't supposed to produce different results.

(Assuming your mini batching etc is all setup correctly) none the less I love exploration and can't wait to learn more, thanks for sharing dude!

[–] kindacognizant@alien.top 1 points 11 months ago

> Multiple passes at lower learning rates isn't supposed to produce different results.

Oh, I was wrong on this, then, my bad.

So would my interpretation be correct that this is essentially causing the overfitting to still happen, just significantly slower, and that a higher LR would work? The problem is at first the average loss tanked in the span of like a single epoch to near zero which overfit, but this LR didn't have the same effect.

[–] kindacognizant@alien.top 1 points 11 months ago

gpt4 is claiming this comment's claim is wrong, but I can't trust it blindly ofc, i'll look into my initial claim to verify

[–] vasileer@alien.top 1 points 11 months ago (3 children)

Multiple passes at lower learning rates isn't supposed to produce different results.

Overfitting is not a technical challenge, its a mathematical property which undeniably exists when ever the training data is smaller than the full problem domain and simultaneously the learning rate (importantly - multiplied by the number of epochs!) would result in a higher specialization ration on the learned to unobserved data than would be expected based on the ration of the learned to unobserved size.

Basically if you learn 1 digit addition but half your training sets involve the left number being 1 and none of your training sets involve your left number being 5 then likely your model will treat 5 and 1 the same (since it's so over trained on examples with 1s)

GPT-4:

The statement contains several inaccuracies:

  1. Multiple passes at lower learning rates: It's not entirely true that multiple passes with lower learning rates will produce identical results. Different learning rates can lead to different convergence properties, and multiple passes with lower learning rates can help in fine-tuning the model and potentially avoid overfitting by making smaller, more precise updates to the weights.
  2. Overfitting as a mathematical property: Overfitting is indeed more of an empirical observation than a strict mathematical property. It is a phenomenon where a model learns the training data too well, including its noise and outliers, which harms its performance on unseen data. It's not strictly due to the size of the training data but rather the model's capacity to learn from it relative to its complexity.
  3. Learning rate multiplied by the number of epochs: The learning rate and the number of epochs are both factors in a model's training process, but their product is not a direct measure of specialization. Instead, it's the learning rate's influence on weight updates over time (across epochs) that can affect specialization. Moreover, a model's capacity and the regularization techniques applied also significantly influence overfitting.
  4. Example of learning 1 digit addition: The example given is somewhat simplistic and does not fully capture the complexities of overfitting. Overfitting would mean the model performs well on the training data (numbers with 1) but poorly on unseen data (numbers with 5). However, the example also suggests a sampling bias in the training data, which is a separate issue from overfitting. Sampling bias can lead to a model that doesn't generalize well because it hasn't been exposed to a representative range of the problem domain.

Overall, while the intention of the statement is to describe overfitting and the effects of learning rates, it conflates different concepts and could benefit from clearer differentiation between them.

[–] kindacognizant@alien.top 1 points 11 months ago

I am inclined to believe gpt4 since it consistently claims this across both the API and your comment... but I'm not sure

[–] ganzzahl@alien.top 1 points 11 months ago

Much simpler than GPT-4 – the person above seems to be referring to gradient accumulation (since they mentioned minibatches), where you add up gradients until you reach the target batch size, then apply them. This is perfectly equivalent to training on a larger batch.

Actually training on small batches with a low learning rate, however, and applying the gradients immediately, is definitely not equivalent to a bigger batch with a bigger learning rate, especially if you're in a particularly unstable part of parameter space, where large learning rates might overshoot. On the other hand, the tiny batches would tend to make the direction your model moves somewhat random, which might be good, might be bad.

Whether or not this actually does what OP wants it to is really just an empirical question. If they did it, and it worked better than bigger batches with the same data, then I guess it helped (in this case with this model and this data), haha

[–] Revolutionalredstone@alien.top 1 points 11 months ago

Good info thanks for that!

[–] keepthepace@alien.top 1 points 11 months ago (2 children)

Multiple passes at lower learning rates isn't supposed to produce different results.

Oh yes it is. The whole point of gradient descent is to slowly explore the dimensions of the gradient. With smaller steps you have a totally different trajectory than with bigger steps. And every pass makes you move.

If you choose a too small learning rate you often will indeed just move slower on the same path but a too big learning rate makes you skip entire paths.

OP seems to have been in that case with their first attempt.

[–] kindacognizant@alien.top 1 points 11 months ago (1 children)

So you're saying my intuition isn't wrong, necessarily, that slow training to learn the small subtle details could work as long as the dataset wasn't *too* limited in scope?

[–] StaplerGiraffe@alien.top 1 points 11 months ago (1 children)

You are correct. Small learning rate allows to do fine adjustments to parameters and thereby learning subtle features. However, initially learning subtle features is useless, since you need to learn the coarse features first. That's why learning rate schedulers go from large learning rate to small learning rate. The tricky bit is doing the minimal amount of training on a large learning rate. That is where various optimizers come in, which try do automate these kinds of things.

You could try to do this by hand by saving checkpoints periodically, and try to find the point where you go from undertrained to overtrained. Then pick a checkpoint which is slightly undertrained, and start training from there with a lower learning rate.

[–] kindacognizant@alien.top 1 points 11 months ago (1 children)

Considering there's an implementation of the cosine scheduler with warmup steps, is there any implementation of a scheduler that starts slow, then rapidly accelerates, and finally stabilizes to learn the subtle features (like a sigmoidal function?) To avoid starting too high in the first place.

https://preview.redd.it/qb1z0n7oci2c1.png?width=1200&format=png&auto=webp&s=15dbab7b3a18ab918defbbbe2ab6816aaa46b489

[–] StaplerGiraffe@alien.top 1 points 11 months ago

Honestly, no idea. I have more theoretical than practical understanding. But my idea of the warmup phase is to arrange the initial totally random weights of a network into something where you can optimize on. When finetuning you don't start from randomness, you start from a trained checkpoint, so I expect that the warmup phase is pointless (at least for SGD, no idea if it helps adaptive optimizers). So believe you should go from high learning rate to low learning rate, unless somebody knows better.

Oh, and when training Loras, remember that changing alpha also changes the learning rate by the same factor if I remember right. So many tests about optimal alpha are probably invalid, because people didn't adjust the learning rate.

[–] Revolutionalredstone@alien.top 1 points 11 months ago

Good to know thank you

[–] involviert@alien.top 1 points 11 months ago (1 children)

I think that's not the whole story. The smaller increments can lead to "course changes" that would not have happened otherwise. Might let things slip into other local minima and all that. It's not just several small steps instead of one big one. The straight line that is the big step will become a curve, capable of bringing you into an entirely different place. The whole dataset can have its impact before some giant leaps jump into a single direction. As a laymen, maybe I've got this wrong, but I really don't see how you can categorically dismiss the possiblity of creating a much more robust and effective architecture instead of essentially jumping to conclusions and then somewhat fixing it up.

[–] Revolutionalredstone@alien.top 1 points 11 months ago

That makes sense cheers

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (1 children)

I like the idea of trying to get tiny 13 sample dataset to work. Can you upload adapter files or full fp16 model? With what you have uploaded currently, only llama.cpp and derivatives can be used for inference. If you would upload adapter files, someone could merge them with base model and for example run it in exllama.

[–] sergeant113@alien.top 1 points 11 months ago (1 children)

I think the idea has merits. How do you propose for others to contribute to the training dataset?

[–] FullOf_Bad_Ideas@alien.top 1 points 11 months ago (1 children)

They are really small, one person can make them relatively quickly, so I don't think there are huge gains to be had by splitting the work. You can always push the dataset to huggingface and make it public, allowing others to add their samples.

[–] kindacognizant@alien.top 1 points 11 months ago