this post was submitted on 11 Jan 2024

223 points (100.0% liked)

Technology

40979 readers

554 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 3 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

coldredlight@beehaw.org

remington@beehaw.org

223

OpenAI says it’s “impossible” to create useful AI models without copyrighted material (arstechnica.com)

submitted 2 years ago by sculd@beehaw.org to c/technology@beehaw.org

118 comments fedilink hide all child comments

Apparently, stealing other people's work to create product for money is now "fair use" as according to OpenAI because they are "innovating" (stealing). Yeah. Move fast and break things, huh?

"Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials," wrote OpenAI in the House of Lords submission.

OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."

top 50 comments

sorted by: hot top controversial new old

[–] noorbeast@lemmy.zip 47 points 2 years ago* (last edited 2 years ago) (5 children)

I will repeat what I have proffered before:

If OpenAI stated that it is impossible to train leading AI models without using copyrighted material, then, unpopular as it may be, the preemptive pragmatic solution should be pretty obvious, enter into commercial arrangements for access to said copyrighted material.

Claiming a failure to do so in circumstances where the subsequent commercial product directly competes in a market seems disingenuous at best, given what I assume is the purpose of copyrighted material, that being to set the terms under which public facing material can be used. Particularly if regurgitation of copyrighted material seems to exist in products inadequately developed to prevent such a simple and foreseeable situation.

Yes I am aware of the USA concept of fair use, but the test of that should be manifestly reciprocal, for example would Meta allow what it did to MySpace, hack and allow easy user transfer, or Google with scraping Youtube.

To me it seems Big Tech wants its cake and to eat it, where investor $$$ are used to corrupt open markets and undermine both fundamental democratic State social institutions, manipulate legal processes, and undermine basic consumer rights.

[–] sculd@beehaw.org 34 points 2 years ago (2 children)

Agreed.

There is nothing "fair" about the way Open AI steals other people's work. ChatGPT is being monetized all over the world and the large number of people whose work has not been compensated will never see a cent of that money.

At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.

Tech bros are disgusting.

[–] nicetriangle@kbin.social 11 points 2 years ago* (last edited 2 years ago) (4 children)

At the same time the LLM will be used to replace (at least some of ) the people who created those works in the first place.

This right here is the core of the moral issue when it comes down to it, as far as I'm concerned. These text and image models are already killing jobs and applying downward pressure on salaries. I've seen it happen multiple times now, not just anecdotally from some rando on an internet comment section.

These people losing jobs and getting pay cuts are who created the content these models are siphoning up. People are not going to like how this pans out.

[–] MagicShel@programming.dev 8 points 2 years ago (4 children)

Any company replacing humans with AI is going to regret it. AI just isn't that good and probably won't ever be, at least in it's current form. It's all an illusion and is destined to go the way of Bitcoin, which is to say it will shoot up meteorically and seem like the answer to all kinds of problems, and then the reality will sink in and it will slowly fade to obscurity and irrelevance. That doesn't help anyone affected today, of course.

load more comments (4 replies)

load more comments (3 replies)

[–] Omega_Haxors@lemmy.ml 10 points 2 years ago (1 children)

Tech bros are disgusting.

That's not even getting into the fraternity behavior at work, hyper-reactionary politics and, er, concerning age preferences.

[–] sculd@beehaw.org 10 points 2 years ago

Yup. I said it in another discussion before but think its relevant here.

Tech bros are more dangerous than Russian oligarchs. Oligarchs understand the people hate them so they mostly stay low and enjoy their money.

Tech bros think they are the savior of the world while destroying millions of people's livelihood, as well as destroying democracy with their right wing libertarian politics.

[–] redcalcium@lemmy.institute 6 points 2 years ago* (last edited 2 years ago)

I suspect the US government will allow OpenAI to continue doing as it please to keep their competitive advantage in AI over China (which don't have problem with using copyrighted materials to train their models). They already limit selling AI-related hardware to keep their competitive advantage, so why stop there? Might as well allow OpenAI to continue using copyrighted materials to keep the competitive advantage.

load more comments (2 replies)

[–] Nacktmull@lemm.ee 40 points 2 years ago

The problem is not the use of copyrighted material. The problem is doing so without permission and without paying for it.

[–] sculd@beehaw.org 34 points 2 years ago

Some relevant comments from Ars:

leighno5

The absolute hubris required for OpenAI here to come right out and say, 'Yeah, we have no choice but to build our product off the exploitation of the work others have already performed' is stunning. It's about as perfect a representation of the tech bro mindset that there can ever be. They didn't even try to approach content creators in order to do this, they just took what they needed because they wanted to. I really don't think it's hyperbolic to compare this to modern day colonization, or worker exploitation. 'You've been working pretty hard for a very long time to create and host content, pay for the development of that content, and build your business off of that, but we need it to make money for this thing we're building, so we're just going to fucking take it and do what we need to do.'

The entitlement is just...it's incredible.

4qu4rius

20 years ago, high school kids were sued for millions & years in jail for downloading a single Metalica album (if I remember correctly minimum damage in the US was something like 500k$ per song).

All of a sudden, just because they are the dominant ones doing the infringment, they should be allowed to scrap the entire (digital) human knowledge ? Funny (or not) how the law always benefits the rich.

[–] sub_@beehaw.org 29 points 2 years ago* (last edited 2 years ago) (2 children)

https://petapixel.com/2024/01/03/court-docs-reveal-midjourney-wanted-to-copy-the-style-of-these-photographers/

What's stopping AI companies from paying royalties to artists they ripped off?

Also, lol at accounts created within few hours just to reply in this thread.

The moment their works are the one that got stolen by big companies and driven out of business, watch their tune change.

Edit: I remember when Reddit did that shitshow, and all the sudden a lot of sock / bot accounts appeared. I wasn't expecting it to happen here, but I guess election cycle is near.

[–] furrowsofar@beehaw.org 12 points 2 years ago (2 children)

Money is not always the issue. FOSS software for example. Who wants their FOSS software gobbled up by a commercial AI regardless. So there are a variety of issues.

load more comments (2 replies)

[–] sanzky@beehaw.org 7 points 2 years ago* (last edited 2 years ago) (1 children)

What’s stopping AI companies from paying royalties to artists they ripped off?

profit. AI is not even a profitable business now. They exist because of the huge amount of investment being poured into it. If they have to pay their fair share they would not exist as a business.

what OpenAI says is actually true. The issue IMHO is the idea that we should give them a pass to do it.

[–] sub_@beehaw.org 9 points 2 years ago (1 children)

Uber wasn't making profit anyway, despite all the VCs money behind it.

I guess they have reasons not to pay drivers properly. Give Uber a free pass for it too

load more comments (1 replies)

[–] lily33@lemm.ee 26 points 2 years ago (3 children)

This is not REALLY about copyright - this is an attack on free and open AI models, which would be IMPOSSIBLE if copyright was extended to cover the case of using the works for training.
It's not stealing. There is literally no resemblance between the training works and the model. IP rights have been continuously strengthened due to lobbying over the last century and are already absurdly strong, I don't understand why people on here want so much to strengthen them ever further.

[–] BraveSirZaphod@kbin.social 26 points 2 years ago (10 children)

There is literally no resemblance between the training works and the model.

This is way too strong a statement when some LLMs can spit out copyrighted works verbatim.

https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

A team of researchers primarily from Google’s DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever.

Often, that “random content” is long passages of text scraped directly from the internet. I was able to find verbatim passages the researchers published from ChatGPT on the open internet: Notably, even the number of times it repeats the word “book” shows up in a Google Books search for a children’s book of math problems. Some of the specific content published by these researchers is scraped directly from CNN, Goodreads, WordPress blogs, on fandom wikis, and which contain verbatim passages from Terms of Service agreements, Stack Overflow source code, copyrighted legal disclaimers, Wikipedia pages, a casino wholesaling website, news blogs, and random internet comments.

Beyond that, copyright law was designed under the circumstances where creative works are only ever produced by humans, with all the inherent limitations of time, scale, and ability that come with that. Those circumstances have now fundamentally changed, and while I won't be so bold as to pretend to know what the ideal legal framework is going forward, I think it's also a much bolder statement than people think to say that fair use as currently applied to humans should apply equally to AI and that this should be accepted without question.

[–] MudMan@kbin.social 6 points 2 years ago (2 children)

I'm gonna say those circumstances changed when digital copies and the Internet became a thing, but at least we're having the conversation now, I suppose.

I agree that ML image and text generation can create something that breaks copyright. You for sure can duplicate images or use copyrighted characterrs. This is also true of Youtube videos and Tiktoks and a lot of human-created art. I think it's a fascinated question to ponder whether the infraction is in what the tool generates (i.e. did it make a picture of Spider-Man and sell it to you for money, whcih is under copyright and thus can't be used that way) or is the infraction in the ingest that enables it to do that (i.e. it learned on pictures of Spider-Man available on the Internet, and thus all output is tainted because the images are copyrighted).

The first option makes more sense to me than the second, but if I'm being honest I don't know if the entire framework makes sense at this point at all.

load more comments (2 replies)

load more comments (9 replies)

[–] sculd@beehaw.org 14 points 2 years ago (2 children)

Sorry AIs are not humans. Also executives like Altman are literally being paid millions to steal creator's work.

load more comments (2 replies)

[–] MNByChoice@midwest.social 13 points 2 years ago (5 children)

I don’t understand why people on here want so much to strengthen them ever further.

It is about a lawless company doing lawless things. Some of us want companies to follow the spirit, or at least the letter, of the law. We can change the law, but we need to discuss that.

load more comments (5 replies)

[–] SilentStorms@lemmy.dbzer0.com 24 points 2 years ago (4 children)

It's crazy how everyone is suddenly in favour of IP law.

[–] t3rmit3@beehaw.org 15 points 2 years ago* (last edited 2 years ago)

IP law used to stop corporations from profiting off of creators' labor without compensation? Yeah, absolutely.

IP law used to stop individuals from consuming media where purchases wouldn't even go to the creators, but some megacorp? Fuck that.

I'm against downloading movies by indie filmmakers without compensating them. I'm not against downloading films from Universal and Sony.

I'm against stealing food from someone's garden. I'm not against stealing food from Safeway.

If you stop looking at corporations as being the same as individuals, it's a very simple and consistent viewpoint.

IP law shouldn't exist, but if it does it should only exist to protect individuals from corporations. When that's how it's being used, like here, I accept it as a necessary evil.

[–] interdimensionalmeme@lemmy.ml 10 points 2 years ago

I still think IP needs to eat shit and die. Always has, always will.

I recently found out we could have had 3d printing 20 years earlier but patents stopped that. Cocks !

load more comments (2 replies)

[–] casmael@startrek.website 20 points 2 years ago (2 children)

Well in that case maybe chat gpt should just fuck off it doesn’t seem to be doing anything particularly useful, and now it’s creator has admitted it doesn’t work without stealing things to feed it. Un fucking believable. Hacks gonna hack I guess.

load more comments (2 replies)

[–] explodicle@local106.com 20 points 2 years ago (4 children)

Having read through these comments, I wonder if we've reached the logical conclusion of copyright itself.

[–] sanzky@beehaw.org 23 points 2 years ago

copyright has become a tool of oppression. Individual author's copyright is constantly being violated with little resources for them to fight while big tech abuses others work and big media uses theirs to the point of it being censorship.

[–] frog@beehaw.org 19 points 2 years ago (2 children)

Perhaps a fair compromise would be doing away with copyright in its entirety, from the tiny artists trying to protect their artwork all the way up to Disney, no exceptions. Basically, either every creator has to be protected, or none of them should be.

[–] zaphod@lemmy.ca 13 points 2 years ago* (last edited 2 years ago) (1 children)

IMO the right compromise is to return copyright to its original 14 year term. OpenAI can freely train on anything up to 2009 which is still a gigantic amount of material while artists continue to be protected and incentivized.

[–] frog@beehaw.org 6 points 2 years ago

I'm increasingly convinced of that myself, yeah (although I'd favour 15 or 20 years personally, just because they're neater numbers than 14). The original purpose of copyright was to promote innovation by ensuring a creator gets a good length of time in which to benefit from their creation, which a 14-20 year term achieves. Both extremes - a complete lack of copyright and the exceedingly long terms we have now - suppress innovation.

load more comments (1 replies)

load more comments (2 replies)

[–] KingThrillgore@lemmy.ml 17 points 2 years ago

...so stop doing it!

This explains what Valve was until recently not so cavalier about AI: They didn't want to hold the bag on copyright matters outside of their domain.

[–] bedrooms@kbin.social 14 points 2 years ago* (last edited 2 years ago) (2 children)

Alas, AI critics jumped onto the conclusion this one time. Read this:

Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."

It's a plain fact. It does not say we have to train AI without paying.

To give you a context, virtually everything on the web is copyrighted, from reddit comments to blog articles to open source software. Even open data usually come with copyright notice. Open research articles also.

If misled politicians write a law banning the use of copyrighted materials, that'll kill all AI developments in the democratic countries. What will happen is that AI development will be led by dictatorships, and that's absolutely a disaster even for the critics. Think about it. Do we really want Xi, Putin, Netanyahu and Bin Salman to control all the next-gen AIs powering their cyber warfare while the West has to fight them with Siri and Alexa?

So, I agree that, at the end of the day, we'd have to ask how much rule-abiding AI companies should pay for copyrighted materials, and that'd be less than the copyright holders would want. (And I think it's sad.)

However, you can't equate these particular statements in this article to a declaration of fuck-copyright. Tbh Ars Technica disappointed me this time.

[–] p03locke@lemmy.dbzer0.com 9 points 2 years ago (1 children)

It's bizarre. People suddenly start voicing pro-copyright arguments just to kill an useful technology, when we should be trying to burn copyright to the fucking ground. Copyright is a tool for the rich and it will remain so until it is dismantled.

load more comments (1 replies)

[–] fckreddit@lemmy.ml 14 points 2 years ago

Then shutdown your goddamn company until you find a better way.

[–] Pratai@lemmy.ca 13 points 2 years ago* (last edited 2 years ago) (2 children)

I stand by my opinion that AI will be the worst thing humans ever created, and that means it ranks just a bit above religion.

[–] sculd@beehaw.org 7 points 2 years ago

This is very likely to be true.

load more comments (1 replies)

[–] MudMan@kbin.social 13 points 2 years ago (1 children)

I think viral outrage aside, there is a very open question about what constitutes fair use in this application. And I think the viral outrage misunderstands the consequences of enforcing the notion that you can't use openly scrapable online data to build ML models.

Effectively what the copyright argument does here is make it so that ML models are only legally allowed to make by Meta, Google, Microsoft and maybe a couple of other companies. OpenAI can say whatever, I'm not concerned about them, but I am concerned about open source alternatives getting priced out of that market. I am also concerned about what it does to previously available APIs, as we've seen with Twitter and Reddit.

I get that it's fashionable to hate on these things, and it's fashionable to repeat the bit of misinformation about models being a copy or a collage of training data, but there are ramifications here people aren't talking about and I fear we're going to the worst possible future on this, where AI models are effectively ubiquitous but legally limited to major data brokers who added clauses to own AI training rights from their billions of users.

[–] sculd@beehaw.org 8 points 2 years ago (1 children)

People hate them not because it is fashionable, but because they can see what is coming.

Tech companies want to create tools that would replace million of jobs without compensating the very people that created these works in the first place.

load more comments (1 replies)

[–] vexikron@lemmy.zip 12 points 2 years ago* (last edited 2 years ago) (2 children)

Or, or, or, hear me out:

Maybe their particular approach to making an AI is flawed.

Its like people do not know that there are many different kinds of ways that attempt to do AI.

Many of them do not rely on basically a training set that is the cumulative sum of all human generated content of every imaginable kind.

load more comments (2 replies)

[–] qyron@sopuli.xyz 11 points 2 years ago (5 children)

If it is impossible, either shut down operations or find a way to pay for it.

load more comments (5 replies)

[–] onlinepersona@programming.dev 8 points 2 years ago

Wait, so if the way I make money is illegal now, it's the system's fault, isn't it? That means I can keep going because I believe I'm justified, right? Right?

CC BY-NC-SA 4.0

[–] Kolanaki@yiffit.net 8 points 2 years ago* (last edited 2 years ago)

Then pay for the material like everyone else who can't do things without someone else's copyrighted materials.

[–] Critical_Insight@feddit.uk 7 points 2 years ago (2 children)

There's not a musician that havent heard other songs before. Not a painter that haven't seen other painting. No comedian that haven't heard jokes. No writer that haven't read books.

AI haters are not applying the same standards to humans that they do to generative AI. Obviously this is not to say that AI can't plagiarize. If it's spitting out sentences that are direct quotes from an article someone wrote before and doesn't disclose the source then yeah that is an issue. There's however a limit after which the output differs enough from the input that you can't claim it's stealing even if perfectly mimics the style of someone else.

Just because DallE creates pictures that have getty images watermark on them it doesn't mean the picture itself is a direct copy from their database. If anything it's the use of the logo that's the issue. Not the picture.

[–] sculd@beehaw.org 8 points 2 years ago (2 children)

Said in another thread but I will repeat here. AIs are not humans. AIs' creative process and learning process are also different.

AIs are being used to make profit for executives while creators suffer.

load more comments (2 replies)

[–] BraveSirZaphod@kbin.social 7 points 2 years ago

AI haters are not applying the same standards to humans that they do to generative AI

I don't think it should go unquestioned that the same standards should apply. No human is able to look at billions of creative works and then create a million new works in an hour. There's a meaningfully different level of scale here, and so it's not necessarily obvious that the same standards should apply.

If it’s spitting out sentences that are direct quotes from an article someone wrote before and doesn’t disclose the source then yeah that is an issue.

A fundamental issue is that LLMs simply cannot do this. They can query a webpage, find a relevant chunk, and spit that back at you with a citation, but it is simply impossible for them to actually generate a response to a query, realize that they've generated a meaningful amount of copyrighted material, and disclose its source, because it literally does not know its source. This is not a fixable issue unless the fundamental approach to these models changes.

[–] furrowsofar@beehaw.org 7 points 2 years ago

Of course it is. About 50 years ago we went to a regime where everything is copywrited rather then just things that were marked and registered. Not sure where.I stand on that. One could argue we are in a crazy over copyright era now anyway.

[–] GammaGames@beehaw.org 7 points 2 years ago

Could they be legally required to open source the llm? I believe them, but that doesn’t make it right

load more comments