We need ~~three~~ four things:
- A way to poison the data that will throw off the training without causing perceptible difference to humans. As I remember it, many image AIs were sensitive to a peculiar noise that was imperceptible to humans.
- A skiplist of AI data stealers, so that their IPs/domains can be blocked in bulk.
- Eventually, the above technique will become useless as AI data stealers will start using dynamic IPs and botnets to bypass the skiplists. We'll need to throttle or block data to visitors based on pattern recognition. For example, if the visitor requests linked pages in rapid succession. Or if the request interval is uniform or pseudo random, instead of genuinely random.
- If the pattern recognition above is triggered, we could even feed the bots with data from AI models, instead of blocking or throttling. Let the AI eat its own s**t.
Nice idea!
In addition, we could have an allowlist for honest bots (like search crawlers).