I think that hypothesis still holds as it has always assumed training data of sufficient quality. This study is more saying that the places where we've traditionally harvested training data from are beginning to be polluted by low-quality training data
ArcticDagger
From the article:
To demonstrate model collapse, the researchers took a pre-trained LLM and fine-tuned it by training it using a data set based on Wikipedia entries. They then asked the resulting model to generate its own Wikipedia-style articles. To train the next generation of the model, they started with the same pre-trained LLM, but fine-tuned it on the articles created by its predecessor. They judged the performance of each model by giving it an opening paragraph and asking it to predict the next few sentences, then comparing the output to that of the model trained on real data. The team expected to see errors crop up, says Shumaylov, but were surprised to see “things go wrong very quickly”, he says.
What they see as "bad research" is looking at an older cohort without taking into consideration their earlier drinking habits - that is, were they previously alcoholics or did they generally have other problems with their health?
If you don't correct for these things, you might find that people who are not drinking seems less healthy than people who are. BUT, that's not because they're not drinking, it's just because of their preexisting conditions. Their peers who are drinking a little bit tend to not have these preexisting conditions (on average)
Here's an actual explanation of the 'sneaked reference':
However, we found through a chance encounter that some unscrupulous actors have added extra references, invisible in the text but present in the articles’ metadata, when they submitted the articles to scientific databases. The result? Citation counts for certain researchers or journals have skyrocketed, even though these references were not cited by the authors in their articles.
Thank you, those are some good points!
Could you explain a bit more about why it's insane to have it as a docked volume instead of a mount point on the host? I'm not too well-versed with docker (or maybe hosting in general)
Edit: typo
Interesting that they have such a greedy/stupid bot
I would say no. Just as it's not legitimate for any other business to break the law even if that means they're not going to be profitable
Further, most of the times, it's simply infeasible to test the data in-depth. We're all humans with busy schedules and it is, unfortunately, not trivial to replicate experiments. If a reviewer feels more data is needed to support a claim, they can ask for a follow-up test or experiment, but it has to be within reason
Yea, not the most clear title about what the article is about hahah
From the article:
[1] https://www.nature.com/articles/nn.4458