this post was submitted on 13 Sep 2023

58 points (100.0% liked)

Technology

43004 readers

33 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:

This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 4 years ago

MODERATORS

alyaza@beehaw.org

TheRtRevKaiser@beehaw.org

gyrfalcon@beehaw.org

rs5th@beehaw.org

SemioticStandard@beehaw.org

TheRtRevKaiser@kbin.social

coldredlight@beehaw.org

remington@beehaw.org

AI Lie: Machines Don’t Learn Like Humans (And Don’t Have the Right To) (www.tomshardware.com)

submitted 2 years ago by RickRussell_CA@beehaw.org to c/technology@beehaw.org

46 comments fedilink hide all child comments

Avram Piltch is the editor in chief of Tom's Hardware, and he's written a thoroughly researched article breaking down the promises and failures of LLM AIs.

you are viewing a single comment's thread
view the rest of the comments

[–] RickRussell_CA@beehaw.org 23 points 2 years ago* (last edited 2 years ago) (2 children)

Two things:

Many of these LLMs -- perhaps all of them -- have been trained on datasets that include books that were absolutely NOT released into the public domain.
Ethically, we would ask any author who parrots the work of others to provide citations to original references. That rarely happens with AI language models, and if they do provide citations, they often do it wrong.

[–] lily33@lemm.ee 19 points 2 years ago (2 children)

I'm sick and tired of this "parrots the works of others" narrative. Here's a challenge for you: go to https://huggingface.co/chat/, input some prompt (for example, "Write a three paragraphs scene about Jason and Carol playing hide and seek with some other kids. Jason gets injured, and Carol has to help him."). And when you get the response, try to find the author that it "parroted". You won't be able to - because it wouldn't just reproduce someone else's already made scene. It'll mesh maaany things from all over the training data in such a way that none of them will be even remotely recognizable.

[–] RickRussell_CA@beehaw.org 14 points 2 years ago (2 children)

And yet, we know that the work is mechanically derivative.

[–] keegomatic@kbin.social 12 points 2 years ago* (last edited 2 years ago) (1 children)

So is your comment. And mine. What do you think our brains do? Magic?

edit: This may sound inflammatory but I mean no offense

[–] RickRussell_CA@beehaw.org 3 points 2 years ago

No, I get it. I'm not really arguing that what separates humans from machines is "libertarian free will" or some such.

But we can properly argue that LLM output is derivative because we know it's derivative, because we designed it. As humans, we have the privilege of recognizing transformative human creativity in our laws as a separate entity from derivative algorithmic output.

[–] lily33@lemm.ee 4 points 2 years ago* (last edited 2 years ago) (1 children)

From Wikipedia, "a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work".

You can probably can the output of an LLM 'derived', in the same way that if I counted the number of 'Q's in Harry Potter the result derived from Rowling's work.

But it's not 'derivative'.

Technically it's possible for an LLM to output a derivative work if you prompt it to do so. But most of its outputs aren't.

[–] RickRussell_CA@beehaw.org 4 points 2 years ago (1 children)

a derivative work is an expressive creation that includes major copyrightable elements of a first, previously created original work

What was fed into the algorithm? A human decided which major copyrighted elements of previously created original work would seed the algorithm. That's how we know it's derivative.

If I take somebody's copyrighted artwork, and apply Photoshop filters that change the color of every single pixel, have I made an expressive creation that does not include copyrightable elements of a previously created original work? The courts have said "no", and I think the burden is on AI proponents to show how they fed copyrighted work into an mechanical algorithm, and produced a new expressive creation free of copyrightable elements.

[–] lily33@lemm.ee 4 points 2 years ago* (last edited 2 years ago)

I think the test for "free of copyrightable elements" is pretty simple - can you look at the new creation and recognize any copyrightable elements in it? The process by which it was created doesn't matter. Maybe I made this post entirely by copy-pasting phrases from other people, who knows (well, I didn't, only because it would be too much work), but it does not infringe either way...

[–] state_electrician@discuss.tchncs.de 0 points 2 years ago (1 children)

Well, I think that these models learn in a way similar to humans as in it's basically impossible to tell where parts of the model came from. And as such the copyright claims are ridiculous. We need less copyright, not more. But, on the other hand, LLMs are not humans, they are tools created by and owned by corporations and I hate to see them profiting off of other people's work without proper compensation.

I am fine with public domain models being trained on anything and being used for noncommercial purposes without being taken down by copyright claims.

[–] RickRussell_CA@beehaw.org 1 points 2 years ago

it’s basically impossible to tell where parts of the model came from

AIs are deterministic.

Train the AI on data without the copyrighted work.
Train the same AI on data with the copyrighted work.
Ask the two instances the same question.
The difference is the contribution of the copyrighted work.

There may be larger questions of precisely how an AI produces one answer when trained with a copyrighted work, and another answer when not trained with the copyrighted work. But we know why the answers are different, and we can show precisely what contribution the copyrighted work makes to the response to any prompt, just by running the AI twice.

[–] RandoCalrandian@kbin.social 4 points 2 years ago (2 children)

Is there a meaningful difference between reproducing the work and giving a summary? Because I’ll absolutely be using AI to filter all the editorial garbage out of news, setup and trained myself to surface what is meaningful to me stripped of all advertising, sponsorships, and detectable bias

[–] RickRussell_CA@beehaw.org 7 points 2 years ago (1 children)

When you figure out how to train an AI without bias, let us know.

[–] RandoCalrandian@kbin.social 6 points 2 years ago (3 children)

You’re confusing ai with chatgpt, but to answer your question: if it’s my own bias, why would I care that it’s in my personal ai? That’s kind of the point: using my personal lens (bias) to determine what info I would be interested in being alerted of

[–] RickRussell_CA@beehaw.org 7 points 2 years ago

The bias is in the AI design and the training dataset.

[–] RaleighEnt@kbin.social 6 points 2 years ago

oooh I dunno man having an AI feed you shit based on what fits your personal biases is basically what social media already does and I do not think that's something we need more of.

[–] Ilandar@aussie.zone 5 points 2 years ago

You’re confusing ai with chatgpt

?????????

[–] Tarte@kbin.social 5 points 2 years ago* (last edited 2 years ago)

I have yet to find an LLM that can summarize a text without errors. I already mentioned this in another post a few days back, but Google‘s new search preview is driving me mad with all the hidden factual errors. They make me click only to realize that the LLM told me what I wanted to find, not what is there (wrong names, wrong dates, etc.).

I greatly prefer the old excerpt summaries over the new imaginary ones (they‘re currently A/B testing).