this post was submitted on 10 Jan 2024
1133 points (96.5% liked)

Technology

59358 readers
5500 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] jacksilver@lemmy.world -1 points 10 months ago

I am familiar with how LLMs work and are trained. I've been using transformers for years.

The core question I'd ask is, if the copyrighted material isn't essential to the model, why don't they just train the models without that data? If it is core to the model, then can you really say they aren't derivative of that content?

I'm not saying that the models don't do something more, just that the more is built upon copyrighted material. In any other commercial situation, you'd have to license/get approval for the underlying content if you were packaging it up. When sampling music, for example, the output will differ greatly from the original song, but because you are building off someone else's work you must compensate them.

Its why content laundering is a great term. The models intermix so much data that it's hard to know if the content originated from copyrighted materials. Just like how money laundering is trying to make it difficult to determine if the money comes from illicit sources.