Technology

82551 readers

4625 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related news or articles.
Be excellent to each other!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
Check for duplicates before posting, duplicates may be removed
Accounts 7 days and younger will have their posts automatically removed.

Approved Bots

founded 2 years ago

MODERATORS

L3s@lemmy.world

enu@lemmy.world

technopagan@lemmy.world

L4s@lemmy.world

L3s@hackingne.ws

177

How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms [TLDR: 25%] (arxiv.org)

submitted 1 day ago* (last edited 1 day ago) by RandAlThor@lemmy.ca to c/technology@lemmy.world

53 comments fedilink hide all child comments

Evaluating 35 open-weight models across three context lengths (32K, 128K, 200K), four temperatures, and three hardware platforms—consuming 172 billion tokens across more than 4,000 runs—we find that the answer is “substantially, and unavoidably.” Even under optimal conditions—best model, best temperature, temperature chosen specifically to minimize fabrication—the floor is non-zero and rises steeply with context length. At 32K, the best model (GLM 4.5) fabricates 1.19% of answers, top-tier models fabricate 5–7%, and the median model fabricates roughly 25%.

you are viewing a single comment's thread
view the rest of the comments

[–] jacksilver@lemmy.world 12 points 23 hours ago (1 children)

Just for context, this is the error rate when the right answer is provided to the LLM in a document. This means that even when the answer is being handed to the LLM they fail at the rates provided in the article/paper.

Most people interacting with LLMs aren't asking questions against documents, or the answer can not be directly inferred from the documents (asking the LLM to think about the materials in the documents).

That means in most situations the error rate for the average user will be significantly higher.

[–] rekabis@lemmy.ca 3 points 23 hours ago* (last edited 23 hours ago) (1 children)

As I pointed out in another root comment, the average - depending on the model being tested - tends to sit between 60% and 80%. But this is with no restriction on source materials… the LLMs are essentially pulling from world+dog in that case

So this opens up an interesting option for users, in that hallucinations/inaccuracies can be controlled for and potentially reduced by as much as ⅔ simply by restricting the model to those documents/resources that the user is absolutely certain contains the correct answer.

I mean, 25% is still stupidly high. In any prior era, even 2.5% would have been an unacceptably high error rate for a business to stomach. But source-restriction seems to be a somewhat promising guardrail to use for the average user doing personal work.

[–] jacksilver@lemmy.world 2 points 22 hours ago

Thanks for providing the actual numbers.

I think one of the more concerning things is, what if you think the answer is in the documents you provided but they actually aren't. What you think is a low error rate could actually be a high error rate.