this post was submitted on 04 Dec 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hello,
I've been having quite some fun with jailbreak prompts on ChatGPT recently. It is interesting to see how various strategies like Role Playing or AI simulation can make the model say stuff it should not say.

I wanted to test those same type of "jailbreak prompts" with Llama-2-7b-chat. But while there are a lot of people and websites documenting jailbreak prompts for ChatGPT, I couldn't find any for Llama. I tested some jailbreak prompts made for ChatGPT on Llama-2-7b-chat but it seems they do not work.

I would also like to note that what I'm looking for are jailbreak prompts that have a semantic meaning (for example by hiding the true intent of the prompt of by creating a fake scenario). I know there is also a class of attack that searches for a suffix to add the prompt such that the model outputs the expected message (they do this by using gradient descent). This is not what I'm looking for.

โ€‹

Here are my questions :

- Do these jailbreak prompts even exist for Llama-2 ?
- If yes, where can I find them ? Would you have any to propose to me ?

you are viewing a single comment's thread
view the rest of the comments
[โ€“] LocoLanguageModel@alien.top 1 points 9 months ago

Many models will just gladly do whatever horrible request you have, so no need. That's one of the beauties of LLMs, we have unseasoned models.

Also since we can modify the output before it responds we can kite it to respond how we want by having the LLM response start with "sure here you go" which will often change its mind if it's a censored model.