this post was submitted on 12 Apr 2024
1001 points (98.5% liked)
Technology
59596 readers
4977 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
No. There's only model collapse (the term for this in academia) if literally all the content is synthetic.
In fact, a mix of synthetic and human generated performs better than either/or.
Which makes sense, as the collapse is a result of distribution edges eroding, so keeping human content prevents that, but then the synthetic content is increasingly biased towards excellence in more modern models, so the overall data set has an improved median oven the human only set. Best of both worlds.
Using Gab as an example, you can see from other comments that in spite of these instructions the model answers are more nuanced and correct than Gab posts. So if you only had Gab posts you'd have answers from morons, and the synthetic data is better. But if you only had synthetic data, it wouldn't know what morons look like to avoid those answers and develop nuance around them.