Naively averaging weights of models trained on disjoint datasets won’t work for LLMs or 1+ hidden layer DNNs
Why would simply aggregating the weights like this categorically fail to produce a reasonable model? Assuming of course that the datasets are all “the same” in some meaningful sense (e.g., equally representative of the same underlying X→Y mappings).
Interesting. I love a good thought experiment :)
But what about the idea of bagging? As in aggregating multiple models together that have all been trained on different examples, and thus learned different things. Why is that not subject to similar criticism?