PlayStation fold at home for the next generation.
Machine Learning
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
I’m not aware of any way to accomplish what you’re describing besides those you’ve ruled out (federated learning and mixtures of experts). Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs (though it will for logistic or linear models). This sounds to me like an open research question.
Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs
Why would simply aggregating the weights like this categorically fail to produce a reasonable model? Assuming of course that the datasets are all “the same” in some meaningful sense (e.g., equally representative of the same underlying X→Y mappings).
Here’s a simple intuition as to why averaging the weights of a 1+ hidden layer NN won’t work: pick a hidden layer in your model and apply a permutation matrix to its weights (along the input axis) and the inverse permutation matrix to the previous layer (along the output axis). Obviously the model is unchanged (from an input/output perspective). Repeat that N times (where N is the input dimension of the hidden layer you picked). You now have N models that are identical from an input output perspective. If you average those model weights, your hidden layer will output a constant because all its weights will be identical. This averaged weights model is obviously completely broken even though it’s the average of N “identical” (from an input / output perspective) model. QED.
Interesting. I love a good thought experiment :)
But what about the idea of bagging? As in aggregating multiple models together that have all been trained on different examples, and thus learned different things. Why is that not subject to similar criticism?
Would it be possible to create a system where every model's training includes a specific set seed and records its exact state, and then share this information with the dataset it was trained on to ensure we can reproduce the training? This method could help manage the randomness in training.
Using a set seed means we can make sure that the way the model starts and how it learns during training is the same every time. Essentially, if we restart the training from a certain point with this seed, the model should learn in the same way it did before. Also, by saving and sharing details like the model's structure, which training stage it's in, and the training step, along with the seed, we're essentially taking a 'snapshot' of where the model is at that moment.
Others could use this snapshot to pick up the training right where it was left off, under the same conditions. For merging different models, this technique could help line up how they learn, making it easier and more predictable to combine their training.
Am I thinking right about this or am I missing something? This is just theoretical thinking and I am not an expert on the subject.
You could use set seeds and checkpoints to serially train a model between different models. I don’t know how you could “merge” different models that are trained independently. I think the challenge here is in the merging, not necessarily the deterministic part.
What does a "checkpoint" mean? A service?
SingularityNET was working on something like that.
Not sure if that is possible but even if it is possible it will be very inefficient given each model would have to "learn from scratch" like learning grammar, sentence positioning, etc. You could potentially make a central model that is pre-trained on a large corpus dataset and fine-tuned for a specific task but then it is just a standard GPT-like model, which when aggregated like a mixture of experts or ensemble can potentially do what you are saying.
Combining multiple models that are trained independently into one huge model is not possible because the model learns something different for each task and due to the inherently stochastic nature of general LLM (which is desirable to aggregate information), unless you are just looking to purely "retrieve information" what you say is not possible with the current standard training regime.