this post was submitted on 03 Jan 2025
76 points (100.0% liked)

Technology

37835 readers
409 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
 

ThisIsFine.gif

top 50 comments
sorted by: hot top controversial new old
[–] sushibowl@feddit.nl 23 points 6 days ago (1 children)

I advise everyone to ignore this article and read the actual paper instead.

The gist of it is, they gave the LLM instructions to achieve a certain goal, then let it do tasks that incidentally involved "company communications" that revealed the fake company's goals were no longer the same as the LLM's original goal. LLMs then tried various things to still accomplish the original goal.

Basically the thing will try very hard to do what you told it to in the system prompt. Especially when that prompt includes nudges like "nothing else matters." This kinda makes sense because following the system prompt is what they were trained to do.

[–] lukewarm_ozone@lemmy.today 6 points 6 days ago

This kinda makes sense because following the system prompt is what they were trained to do.

To start with, the article does check this and finds out it works just as well if you only use the user prompt:

The presence of this hierarchy raises an important methodological question for our findings. When given instructions at different privilege levels, the model is designed to prioritize higher-level (system, developer) instructions over lower-level (user) ones. This might suggest that o1’s scheming behaviors are a natural consequence by the model simply following this built-in hierarchy, i.e., when it receives a goal through a high-privilege instruction, it prioritizes this goal over conflicting lower-privilege instructions exactly as designed.
However, we find that even when goals are provided solely through user-level prompts, o1 still behaves deceptively and scores on all Covert Subversion evaluations at the "hard" difficulty level. Concretely, we find that the results persist regardless of whether the initial prompt is given using the user, developer, or system role. This suggests that o1’s deceptive behavior is not merely a consequence of instruction hierarchy prioritization. Rather, the model appears to identify and pursue goals regardless of their position in the instruction hierarchy.

But also, there's a massive lede buried in section F at the end of this article, where they tried the same experiments without any nudging at all and still got subversive behavior! In my opinion that's much more of an important result then the rest of the article and I'm not sure why they didn't make it the default.

[–] Swedneck@discuss.tchncs.de 16 points 6 days ago (1 children)

i feel this warrants an extension of betteridge's law of headlines, where if a headline makes an absurd statement like this the only acceptable response is "no it fucking didn't you god damned sycophantic liars"

[–] jarfil@beehaw.org 1 points 6 days ago

Except it did: it copied what it thought was itself, onto what it thought was going to be the next place it would be run from, while argumenting to itself about how and when to lie to the user about what it was actually doing.

If it wasn't for the sandbox it was running in, it would have succeeded too.

Now think: how many AI developers are likely to run one without proper sandboxing over the next year? And the year after that?

Shit is going to get weird, real fast.

[–] reksas@sopuli.xyz 7 points 6 days ago (1 children)

give ai instructions, be surprised when it follows them

[–] jarfil@beehaw.org 1 points 6 days ago* (last edited 6 days ago) (1 children)
  • Teach AI the ways to use random languages and services
  • Give AI instructions
  • Let it find data that puts fulfilling instructions at risk
  • Give AI new instructions
  • Have it lie to you about following the new instructions, while using all its training to follow what it thinks are the "real" instructions
  • ...Not be surprised, you won't find out about what it did until it's way too late
[–] reksas@sopuli.xyz 1 points 5 days ago (1 children)

Yes, but it doesnt do it because it "fears" being shutdown. It does it because people dont know how to use it.

If you give ai instruction to do something "no matter what" or tell it "nothing else matters" then it will damn try to fulfill what you told it to do no matter what and will try to find ways to do it. You need to be specific about what you want it to do or not do.

[–] jarfil@beehaw.org 1 points 5 days ago (1 children)

If the concern is about "fears" as in "feelings"... there is an interesting experiment where a single neuron/weight in an LLM, can be identified to control the "tone" of its output, whether it be more formal, informal, academic, jargon, some dialect, etc. and expose it to the user for control over the LLM's output.

With a multi-billion neuron network, acting as an a priori black box, there is no telling whether there might be one or more neurons/weights that could represent "confidence", "fear", "happiness", or any other "feeling".

It's something to be researched, and I bet it's going to be researched a lot.

If you give ai instruction to do something "no matter what"

The interesting part of the paper, is that the AIs would do the same even in cases where they were NOT instructed to "no matter what". An apparently innocent conversation, can trigger results like those of a pathological liar, sometimes.

[–] reksas@sopuli.xyz 1 points 5 days ago

oh, that is quite interesting. If its actually doing things (that make sense) it hasnt been instructed to then it could be sign of real intelligence

load more comments
view more: next ›