TLDR : A leak has revealed the complete system of Anthropic's AI model Claude 3.7 Sonnet, exposing precise technical and behavioral details. This raises questions about the robustness of security mechanisms protecting a model's internal instructions and the balance between performance, controllability, transparency, and security.
Last week, a leak revealed the full system prompt of the hybrid reasoning model Claude 3.7 Sonnet, introduced last February by Anthropic. With an unusual length of 24,000 tokens, the prompt precisely describes the expected behaviors of the model, the tags it uses, the authorized tools, and the stance to adopt towards users. A Rare Insight into the "Guts" of AI
The content of the prompt, which can be found on GitHub, goes far beyond a simple technical configuration. It details precise behavioral guidelines: adopting a nuanced stance, avoiding taking sides on sensitive topics, using Markdown format for code snippets, or explaining its reasoning step by step when relevant. It also contains filtering mechanisms and XML tags, designed to organize Claude's responses for specific use cases.
While this exposure reveals the behavioral engineering that dictates the responses of one of the most advanced conversational agents on the market, it raises a central question: if a model's internal instructions can be exposed and potentially manipulated, to what extent are the security mechanisms supposed to protect them actually robust?
Anthropic and the Bet on Transparency
Since its founding in 2021 by siblings Dario and Daniela Amodei,
Anthropic has promoted an approach focused on the reliability, steerability, and interpretability of AI systems. The company introduced the concept of constitutional AI, a training approach aimed at instilling values and principles into AI models, inspired notably by the Universal Declaration of Human Rights.
This stance has translated into a commitment to transparency: in August 2024,
Anthropic published system prompts for Claude 3 Haiku, Claude 3 Opus, and Claude 3.5 Sonnet in its user interfaces (web and mobile). This approach continued for Claude 3.7 Sonnet, accompanied by a detailed document, the "
Claude 3.7 Sonnet System Card", where not only the technical capabilities of the model are exposed, but also the evaluation methods, security mechanisms, and risk reduction protocols for Claude 3.7 Sonnet.
The model is described as an "intelligent and kind" conversational partner, capable of discursive initiatives, autonomous reasoning, and even subjective hypotheses in certain philosophical contexts. However, as Dario Amodei points out in a blog post titled "The Urgency of Interpretability", a fine understanding of the internal mechanisms of these models remains a significant challenge. The displayed transparency does not hide the opacity of the processes that govern them.
Openness and Security: A Complex Balance
This leak illustrates a growing tension in the development of AI models: how to combine performance, controllability, and transparency without compromising the systems' robustness? Making visible the structures that govern an agent's behavior can allow for external audits or even a debate on the ethical choices made upstream, but how to preserve the integrity of these systems when their foundations are exposed?
As LLMs become the main interfaces for access to information and action in many sectors, the question is no longer simply technical but political, ethical, and strategic.
To better understand
What is constitutional AI and how does it influence AI models like Claude 3.7 Sonnet?
Constitutional AI is an approach that aims to integrate values and principles inspired by documents like the Universal Declaration of Human Rights into the training of AI models. This method influences models such as Claude 3.7 Sonnet by guiding them towards behaviors that reflect these values, such as reliability and interpretability, while addressing ethical concerns.
What are the regulatory implications of increased transparency in system prompts like those used by Claude 3.7 Sonnet?
Increased transparency in system prompts, like those of Claude 3.7 Sonnet, raises issues regarding data protection and user privacy. Regulators may require higher standards to ensure sensitive information is not compromised while balancing this with the need for transparency for auditing and improving AI models.