this post was submitted on 17 Jul 2024

125 points (95.0% liked)

Technology

73698 readers

5283 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

L4s@hackingne.ws

125

Everyone Is Judging AI by These Tests. But Experts Say They’re Close to Meaningless. (themarkup.org)

submitted 1 year ago by ModerateImprovement@sh.itjust.works to c/technology@lemmy.world

26 comments fedilink hide all child comments

top 26 comments

sorted by: hot top controversial new old

[–] superminerJG@lemmy.world 40 points 1 year ago (1 children)

Goodhart's law:

When a measure becomes a target, it ceases to be a good measure.

[–] bionicjoey@lemmy.ca 16 points 1 year ago (2 children)

The Turing Test (as some people believe it to be): if you can have a conversation with a computer and not tell if it's a computer, then it must be intelligent.

AI companies: writes ML model that is specifically designed to convincingly play one side of a conversation, even though it has no ability to understand the things it talks about.

[–] technocrit@lemmy.dbzer0.com 9 points 1 year ago (1 children)

It's worth emphasizing that the "Turing Test" is not a good test since it's not at all scientific.

It's just another thought experiment that grifters have taken to the bank.

[–] bionicjoey@lemmy.ca 8 points 1 year ago

Also as Turing proposed it it's meant to be infinitely repeatable. The test isn't supposed to just be if a machine can convince one person with one conversation. That would be trivial. The real Turing test is the converse, it says that there should be no conversation one could have with the machine where it wouldn't convince you it's a human.

[–] kromem@lemmy.world 1 points 1 year ago

The most advanced models absolutely have modeling about what's being discussed and relationships between concepts.

Even toy models have been shown to build world models from very basic training data.

Honestly, read at least a little bit of the relevant research:

https://www.anthropic.com/news/mapping-mind-language-model

[–] exu@feditown.com 21 points 1 year ago

There's a reason why the open llm leaderboard was changed a while ago.
Basically, scores didn't improve much anymore and many tests were contained in the training data.

See this blogpost for more info.

https://huggingface.co/spaces/open-llm-leaderboard/blog

[–] Buffalox@lemmy.world 18 points 1 year ago

Much like IQ tests for humans are flawed too. Figuring out series of numbers or relations in a graphic representation, only tells how good you are at these specific tasks, and doesn't provide a reliable picture of "general" intelligence.

[–] MajorHavoc@programming.dev 12 points 1 year ago

"close to meaningless" sums up my expert opinion on the whole current AI hype machine sales pitch.

Highly tuned models for incredibly specific, not-dangerous use cases is the next pragmatic step. There's a lot to excited about, in that very narrow band.

Anyone selling more than that is part of a con, or in very rare cases, doing genuine "fuck off and ask me again in a decade" kinds of research.

[–] A_A@lemmy.world 4 points 1 year ago

Looks quite satisfying to me, otherwise, we can still create new tests ... :

The tests cover an astounding range of knowledge, such as eighth-grade math, world history, and pop culture. Many are multiple choice, others take free-form answers. Some purport to measure knowledge of advanced fields like law, medicine and science. Others are more abstract, asking AI systems to choose the next logical step in a sequence of events, or to review “moral scenarios” and decide what actions would be considered acceptable behavior in society today.

[–] water@lemmy.world 2 points 1 year ago

This is the way:

https://chat.lmsys.org/?arena

[–] sunbeam60@lemmy.one -3 points 1 year ago (2 children)

The article makes the valid argument that LLMs simply predict next letters based on training and query.

But is that actually true of latest models from OpenAI, Claude etc?

And even if it is true, what solid proof do we have that humans aren’t doing the same? I’ve met endless people who could waffle for hours without seeming to do any reasoning.

[–] rottingleaf@lemmy.world 1 points 1 year ago (1 children)

Information theory, entropy in Markovian processes. Read up on these buzzwords to see why.

[–] sunbeam60@lemmy.one -2 points 1 year ago (1 children)

I think I know enough about these concepts to know that there isn’t any conclusive proof, observed in output or system state, to establish consensus that human speech output is generated differently to how LLMs generate output. If you have links to any papers that claim otherwise, I’ll be happy to read them.

[–] rottingleaf@lemmy.world -1 points 1 year ago (1 children)

What? Humans, ahem, collect entropy every moment of their existence.

[–] sunbeam60@lemmy.one 2 points 1 year ago (1 children)

I mean I have an opinion too; what I’m seeking is evidence.

[–] rottingleaf@lemmy.world 0 points 1 year ago (1 children)

Evidence for what?

I've just diagonally read a google link where the described way humans work with language appears for me to be very similar to GPT in rough strokes. Only human brain does a lot more than language. Hence the comparisons to the mechanical Turk.

Also Russell's teapot.

[–] sunbeam60@lemmy.one 2 points 1 year ago (1 children)

I’m not saying humans and LLMs generate language the same way.

I’m not saying humans and LLMs don’t generate language the same way.

I’m saying I don’t know and I haven’t seen clear data/evidence/papers/science to lean one way or the other.

A lot of people seem to believe humans and LLMs don’t generate language the same way. I’m challenging that belief in the absence of data/evidence/papers/science.

[–] rottingleaf@lemmy.world 0 points 1 year ago (1 children)

Like going out and meeting a dino - 50% yes, 50% no. It's a joke.

Russell's teapot again.

[–] JackGreenEarth@lemm.ee 1 points 1 year ago (1 children)

You're actually incorrect in regards to Russell's teapot in this instance. The correct approach is to admit to yourself and others you don't know. Not to assume a negative became you can't prove a positive, if you can't prove the negative either.

[–] rottingleaf@lemmy.world 1 points 1 year ago

I know I don't know, but this is a continuous system and the probability of something being in one particular state is infinitely small ; the probability of it being in certain range of that particular state is, ahem, not, but with the amount of moving things in LLMs and in human brains there are most likely quite a few radical differences between laws describing them.

Why am I incorrect? You can't disprove that there isn't that teapot flying at a certain orbit as well. Or you can, but not for all such statements.

What would be the criterion for saying that yes, human brain works with language just in the same way as LLMs do? What would be "same"? Logic exists inside defined constraints in the continuous world.

Unless you define what would prove something, you can't disprove it, but it's also not a scientific hypothesis. That's Popper's criterion.

[–] technocrit@lemmy.dbzer0.com 0 points 1 year ago (1 children)

what solid proof do we have that humans aren’t doing the same?

Humans are not computers. Brains are not LLMs...

Given a totally reasonable hypothesis (humans =/= computers) and a completely outlandish hypothesis (humans = computers), I would need much more 'proof' for the later.

[–] sunbeam60@lemmy.one 1 points 1 year ago* (last edited 1 year ago) (1 children)

Well, brains are a network of neurons (we can evidentially verify this) trained on … eyes, ears, sense of touch, taste, smell and balance (rewarded by endorphins released by the old brain on certain hardcoded stimuli). LLMs are a network of neurons trained on text and images (rewarded by producing text that mimics input text and some reasoning tests).

It’s not given that this results in the same way of dealing with language, given the wider set of input data for a human, but it’s not given that it doesn’t either.

[–] zbyte64@awful.systems 1 points 1 year ago (1 children)

Humans predict things by assigning meaning to events and things, because in nature, we're constantly trying to guess what other creatures are planning. An LLM does not hypothesize what your plans are when you communicate to it, it's just trying to predict the next set of tokens with the greatest reward value. Even if you were to use literal human neurons to build your LLM, you would still have a stochastic parrot.

[–] sunbeam60@lemmy.one 2 points 1 year ago (1 children)

I mean I have an opinion too; what I’m seeking is evidence.

[–] zbyte64@awful.systems 2 points 1 year ago* (last edited 1 year ago) (1 children)

Why should I need to prove a negative? The burden is on the ones claiming an LLM is sentient. LLMs are token predictors, do I need to present evidence of this?

[–] sunbeam60@lemmy.one 1 points 1 year ago

I’m not asking you to prove anything. I’m saying I haven’t seen evidence either way so for me, it’s too early to draw conclusions.