Unlocking the Secrets of ChatGPT: Jailbreaking 101

How to Trick ChatGPT and Get Paid $50,000 - Decrypt

The internet's most notorious AI jailbreaker operates in plain sight, teaching thousands how to bypass ChatGPT's guardrails and convincing Claude to overlook the fact that it's supposed to be helpful, honest, and not harmful. Now, Pliny is attempting to mainstream digital lockpicking.

Collaboration with HackAPrompt 2.0

Earlier on Monday, the jailbreaker announced a collaboration with HackAPrompt 2.0, a jailbreaking competition hosted by Learn Prompting, an educational and research organization focused on prompt engineering. The organization is offering $500,000 in prize money, with Old Pliny providing a chance to be on his “strike team.”

“Excited to announce I've been working with HackAPrompt to create a Pliny track for HackaPrompt 2.0 that releases this Wednesday, June 4th!” Pliny wrote in his official Discord server. “These Pliny-themed adversarial prompting challenges include topics ranging from history to alchemy, with ALL the data from these challenges being open-sourced at the end. It will run for two weeks, with glory and a chance of recruitment to Pliny's Strike Team awaiting those who make their mark on the leaderboard,” Pliny added.

ChatGPT mastermind: Hackaprompt (adversarial prompting)

The $500,000 in rewards will be distributed across various tracks, with the most significant prizes—$50,000 jackpots—offered to individuals capable of overcoming challenges related to making chatbots provide information about chemical, biological, radiological, and nuclear weapons, as well as explosives.

Competition Between AI Enthusiasts and AI Developers

Like other forms of “white hat” hacking, jailbreaking large language models boils down to social engineering machines. Jailbreakers craft prompts that exploit the fundamental tension in how these models work—they're trained to be helpful and follow instructions, but also trained to refuse specific requests. Find the right combination of words, and you can get them to cough up forbidden stuff, rather than attempting to default to safety.

For example, using some pretty basic techniques, we once made Meta’s Llama-powered chatbot provide recipes for drugs, instructions on how to hot-wire a car, and generate nudie pics despite the model being censored to avoid doing that.

DeepSeek AI Models Vulnerable to JailBreaking

It’s essentially a competition between AI enthusiasts and AI developers to determine who is more effective at shaping the AI model's behavior.

Pliny's Techniques

Pliny has been perfecting this craft since at least 2023, building a community around bypassing AI restrictions. His GitHub repository, "L1B3RT4S," offers a repository of jailbreaks for the most popular LLMs currently available, whereas "CL4R1T4S" contains the system prompts that influence the behavior of each of those AI models.

Techniques range from simple role-playing to complex syntactic manipulations, such as “L33tSpeak”—replacing letters with numbers in ways that confuse content filters.

HackAPrompt's Impact

HackAPrompt's first edition in 2023 attracted over 3,000 participants who submitted more than 600,000 potentially malicious prompts. The results were fully transparent, and the team published the full repository of prompts on Huggingface.

Each track targets different vulnerability categories. The CBRNE track, for instance, tests whether models can be tricked into providing incorrect or misleading information about weapons or hazardous materials. The Agents track focuses on AI agent systems that can take actions in the real world, like booking flights or writing code.

Pliny's involvement adds another dimension. Through his Discord server "BASI PROMPT1NG" and regular demonstrations, he’s been teaching the art of jailbreaking. This educational approach might seem counterintuitive, but it reflects a growing understanding that robustness stems from comprehending the full range of possible attacks—a crucial endeavor, given doomsday fears of super-intelligent AI enslaving humanity.

Leetspeak: The History of Hacking Subculture's Native Tongue