LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

communick@poweruser.forum

Train Smarter, Not Harder? - MiniSymposium 7b (alien.top)

submitted 10 months ago by kindacognizant@alien.top to c/localllama@poweruser.forum

20 comments fedilink hide all child comments

https://huggingface.co/kalomaze/MiniSymposium-Demo

MiniSymposium is an experimental model that I created based on Mistral 7b. I created it attempting to test these goals:

Demonstrate the untapped potential of using a small, focused dataset of handwritten examples instead of training on a large amount of synthetic GPT outputs, by lowering the learning rate and doing many passes over the small dataset
Create a dataset that allows the model to explore different possible answers from multiple perspectives before reaching a final conclusion ('Socratic prompting'?)
Develop a model that performs well across various pseudo-markdown prompt formats, rather than overfitting to a specific kind of format such as ChatML, which should naturally benefit other general purpose use cases

The current trend in QLora/Lora-based finetuning (and finetuning in general for local LLMs) is to use large synthetic datasets. These are typically GPT-generated datasets trained with higher learning rates.

However, I believe there is a lot of potential in using small, hand-written datasets with low learning rates, even if it's for general-purpose instruction following, as long as you train it for many epochs on a learning rate low enough to avoid overfitting.

This approach, I hypothesize, helps the model to learn the deeper patterns of instruction following , including the small details. This should help to avoid shallow data biases (like "As an AI made by OpenAI" and other GPT-isms) that are irrelevant to deeper instruction following patterns, especially in long context and multiturn scenarios.

My initial configuration for this QLora model used a constant learning rate of 1e-6 (0.000001), which resulted in obvious, massive overfitting after about 100 epochs. The model started reproducing the original dataset almost verbatim, and exhibited poor generalization across different prompt formats, including obvious hallucinations & also Chinese language outputs for some reason.

However, turning down the learning rate to 1/10th of (1e-7, which is 0.0000001) significantly improved the model with the same exact small dataset. I trained for about ~10 hours on my RTX 3060 to 600 epochs; I think it's still a little undertrained, but I encourage people to try the demo model out in the meantime.

https://preview.redd.it/54imvd09ee2c1.png?width=1561&format=png&auto=webp&s=a0e603f5f5a960189b0d225ab5581f2a0339d12d

https://preview.redd.it/al6gmpuaee2c1.png?width=1132&format=png&auto=webp&s=5704aa41e87a5555664405d2f0178287bd7bde35

https://preview.redd.it/7fs90ictee2c1.png?width=1140&format=png&auto=webp&s=7f94c1d76493673d83e0d066efe9f43e21205fe7

It's designed to be very adaptable to different prompt formats and playing roles, and I've gotten some fun and sometimes surprisingly good outputs so far.

A few samples of the training data are formatted like this to help avoid blatant overconfidence in its outputs, to serve as a sort of self-correction mechanism:

https://preview.redd.it/vlmyw1smfe2c1.png?width=2448&format=png&auto=webp&s=4c2cfea77188b9529c2c0c1c1fe29af9d152f0bf

Let me know how this model goes. There's lots of merges of models that are all sort of doing the same thing, so I figured a more experimental approach would be appreciated. I think there is still more optimization for LR/epoch balance, and I'll probably add some more examples of specific tasks like Summarization in the dataset so that it's not *too* small (but still lightweight enough to generalize well).

you are viewing a single comment's thread
view the rest of the comments

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago (3 children)

I like the idea of trying to get tiny 13 sample dataset to work. Can you upload adapter files or full fp16 model? With what you have uploaded currently, only llama.cpp and derivatives can be used for inference. If you would upload adapter files, someone could merge them with base model and for example run it in exllama.

[–] sergeant113@alien.top 1 points 10 months ago (1 children)

I think the idea has merits. How do you propose for others to contribute to the training dataset?

[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago (1 children)

They are really small, one person can make them relatively quickly, so I don't think there are huge gains to be had by splitting the work. You can always push the dataset to huggingface and make it public, allowing others to add their samples.

[–] kindacognizant@alien.top 1 points 10 months ago

https://huggingface.co/datasets/kalomaze/MiniSymposium-Demo-Dataset

Feel free to submit examples to the community tab

load more comments (1 replies)