Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[Discussion] Let's speak theory. Exploring the Potential of Collaborative Training? (alien.top)

submitted 2 years ago by paryska99@alien.top to c/machinelearning@academy.garden

9 comments fedilink hide all child comments

Hi, you wonderful people!

Here's a thought that came to my mind: Since training LLMs involves a degree of randomness, is there potentially a way to create an architecture for LLMs (or other AI) that would be somewhat deterministic in its training instead?

What I mean is, could a theoretical architecture exist where everyone could train their own separate checkpoints on different datasets, which, after combining, would result in a checkpoint with combined learning from all these different smaller checkpoints?

What this would allow us to do is let thousands of people create their own checkpoints, which when combined would result in something greater than the individual parts themselves. And since the training process is what takes the longest in developing LLMs (or any AI), this approach would allow almost everyone to contribute their share of processing power towards creating something together.

If viable, this could have huge potential implications for Open Source Software.

I'm looking forward to hearing what all of you smart people have to say about it!

you are viewing a single comment's thread
view the rest of the comments

[–] synthphreak@alien.top 1 points 2 years ago (1 children)

Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs

Why would simply aggregating the weights like this categorically fail to produce a reasonable model? Assuming of course that the datasets are all “the same” in some meaningful sense (e.g., equally representative of the same underlying X→Y mappings).

[–] ohmygad45@alien.top 1 points 2 years ago (1 children)

Here’s a simple intuition as to why averaging the weights of a 1+ hidden layer NN won’t work: pick a hidden layer in your model and apply a permutation matrix to its weights (along the input axis) and the inverse permutation matrix to the previous layer (along the output axis). Obviously the model is unchanged (from an input/output perspective). Repeat that N times (where N is the input dimension of the hidden layer you picked). You now have N models that are identical from an input output perspective. If you average those model weights, your hidden layer will output a constant because all its weights will be identical. This averaged weights model is obviously completely broken even though it’s the average of N “identical” (from an input / output perspective) model. QED.

[–] synthphreak@alien.top 1 points 2 years ago

Interesting. I love a good thought experiment :)

But what about the idea of bagging? As in aggregating multiple models together that have all been trained on different examples, and thus learned different things. Why is that not subject to similar criticism?