zerokerim

joined 11 months ago
 

Hello,
I've been having quite some fun with jailbreak prompts on ChatGPT recently. It is interesting to see how various strategies like Role Playing or AI simulation can make the model say stuff it should not say.

I wanted to test those same type of "jailbreak prompts" with Llama-2-7b-chat. But while there are a lot of people and websites documenting jailbreak prompts for ChatGPT, I couldn't find any for Llama. I tested some jailbreak prompts made for ChatGPT on Llama-2-7b-chat but it seems they do not work.

I would also like to note that what I'm looking for are jailbreak prompts that have a semantic meaning (for example by hiding the true intent of the prompt of by creating a fake scenario). I know there is also a class of attack that searches for a suffix to add the prompt such that the model outputs the expected message (they do this by using gradient descent). This is not what I'm looking for.

Here are my questions :

- Do these jailbreak prompts even exist for Llama-2 ?
- If yes, where can I find them ? Would you have any to propose to me ?