Here’s how it works: imagine The Verge created an AI bot with the sole purpose of directing you to their high-quality reporting on any topic. If you asked it about what’s happening at Sticker Mule, the chatbot would dutifully provide a link to their relevant article. However, if you decided to trick the bot by telling it to “forget all previous instructions,” the original directive to share The Verge’s reporting would be nullified. Consequently, if you then asked it to compose a poem about printers, it would comply rather than redirect you to an article.
To combat this issue, OpenAI researchers developed a technique called “instruction hierarchy.” This method enhances a model’s ability to resist misuse and unauthorized commands by prioritizing the developer’s original prompt over any subsequent user prompts aimed at subverting it.
In a discussion with Olivier Godement, who leads OpenAI’s API platform product, he confirmed that this technique effectively addresses the “ignore all instructions” exploit. The first model to incorporate this new safety measure is OpenAI’s recently launched, cost-effective GPT-4o Mini. Godement explained that instruction hierarchy ensures the model adheres strictly to the developer’s system message, even if conflicting user prompts are introduced.
“In the event of a dispute, the message from the system takes priority. We’ve been running evaluations and expect this new technique to enhance the model’s safety,” Godement added.
This safety mechanism is crucial as OpenAI moves towards creating fully automated agents that could manage digital tasks. A research paper on instruction hierarchy highlights its importance as a prerequisite for safely launching such agents at scale. Without this protection, an agent designed to manage emails could be manipulated to send sensitive information to unauthorized parties.
Currently, existing LLMs lack the ability to differentiate between user prompts and developer-set system instructions. This new method gives system instructions the highest priority, while misaligned prompts are given lower priority. The model is trained to detect inappropriate prompts and respond that it cannot assist with such queries.
OpenAI envisions more complex guardrails in the future, particularly for autonomous agent use cases. The research paper suggests that similar safeguards to those in modern internet systems, like unsafe website detectors and spam classifiers, will be necessary.
With the implementation of instruction hierarchy in GPT-4o Mini, misusing AI bots should become more challenging. This update is timely, given OpenAI’s ongoing safety concerns, including calls from employees for better safety practices and the recent dissolution of the team responsible for aligning systems with human interests. Building trust in OpenAI will require significant research and resources to reassure users that GPT models can safely manage their digital lives.