this post was submitted on 28 Feb 2024
403 points (97.6% liked)
Technology
59135 readers
2532 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Most people here don’t understand what this is saying.
We’ve had “pure” human generated data, verifiably so since LLMs and ImageGen didn’t exist. Any bot generated data was easily filterable due to lack of sophistication.
ChatGPT and SD3 enter the chat, generate nearly indistinguishable data from humans, but with a few errors here and there. These errors while few, are spectacular and make no sense to the training data.
2 years later, the internet is saturated with generated content. The old datasets are like gold now, since none of the new data is verifiably human.
This matters when you’ve played with local machine learning and understand how these machines “think”. If you feed an AI generated set to an AI as training data, it learns the mistakes as well as the data. Every generation it’s like mutations form until eventually it just produces garbage.
Training models on generated sets slowly by surely fail without a human touch. Scale this concept to the net fractionally. When 50% of your dataset is machine generated, 50% of your new model trained on it will begin to deteriorate. Do this long enough and that 50% becomes 60 to 70 and beyond.
Human creativity and thought have yet to be replicated. These models have no human ability to be discerning or sleep to recover errors. They simply learn imperfectly and generate new less perfect data in a digestible form.