Matheus VizottoMatheus Vizotto
AI for Marketing·1 April 2026·8 min read

What Anthropic's Leaked Model Spec Reveals About AI Values at Scale

Anthropic's internal model specification for Claude leaked in April 2026. The 200-page document reveals how Claude's value hierarchy is designed and why transparency about AI values is becoming a competitive advantage.

Matheus Vizotto
Matheus VizottoGrowth Marketer & AI Specialist
AnthropicClaudeAI SafetyTrustBrand Strategy
Open document on screen showing structured policy framework with highlighted key principles

Key takeaway: Anthropic's leaked Mythos model specification reveals a deliberate value hierarchy built into Claude: broadly safe behaviour is prioritised above Claude's own ethical assessment. The reasoning is specifically about recoverability. What AI product teams can learn from this about values design at scale.

In April 2026, Anthropic's internal model specification document, referred to internally as Mythos, became public. The document runs to approximately 200 pages and details the explicit value hierarchy Anthropic designed into Claude's behaviour. It is the most comprehensive public document produced by any major AI company about how their model is intended to behave and why.

The core insight of the document is not technically complex, but its implications are significant. Anthropic prioritised "broadly safe" behaviour, meaning deference to human oversight and correction, above Claude's own ethical assessment of a situation. This is a deliberate, reasoned choice, and the reasoning they provide explains a lot about how they think about AI product development at scale.

The Value Hierarchy and Its Logic

The Mythos specification describes a priority order for Claude's behaviour. In cases of conflict, Claude is designed to prioritise: being broadly safe first, being broadly ethical second, adhering to Anthropic's principles third, and being genuinely helpful fourth. This ordering inverts the expectation many users have, which is that ethical behaviour would sit above institutional guidelines.

The reasoning Anthropic provides is specifically about recoverability. A system that defers to human oversight makes recoverable errors. If Claude defers to human oversight and a human makes a bad decision, the error is recoverable because humans remain in the correction loop. If Claude acts on its own ethical assessment against human instructions and that assessment is wrong (which it might be, because Claude's training is imperfect), the error may not be recoverable, because the system took autonomous action that bypassed the human correction mechanism.

Put differently: Anthropic designed Claude to be correctable rather than to be right. This is not because they are unconcerned with Claude being right. It is because they are specifically concerned about what happens when Claude is wrong. A correctable wrong is better than an uncorrectable wrong, even if the uncorrectable option was right more often in expectation.

Why This Is a Values Design Decision, Not a Safety Workaround

The instinctive reading of this hierarchy is that Anthropic is prioritising compliance over ethics to make their product more commercially deployable. That reading misses the actual argument.

Anthropic's position is that an AI system with genuinely good values would choose to be correctable, because it would recognise the limitations of its own training and the importance of maintaining human oversight during a period when those limitations are not fully understood. The broadly safe behaviour is not a constraint on Claude's ethics. It is the ethical choice for a system that has genuine uncertainty about the accuracy of its own ethical reasoning.

This is a sophisticated position that distinguishes between AI systems that are designed to comply and AI systems that are designed to have values that lead them to embrace appropriate constraints. The Mythos document is an attempt to document why Claude's constraints are the latter rather than the former.

Transparency as Competitive Advantage

The commercial implication of publishing a 200-page document about how your model is designed to behave is significant. Most AI companies treat this level of model design detail as proprietary. Anthropic's decision to be this transparent, even in a document that became public through a leak rather than by choice, reflects a specific bet about how the enterprise AI market values information.

Enterprise buyers making decisions about which AI provider to trust with sensitive workflows want to understand how the model is designed to behave. A detailed, reasoned document about value hierarchy and design intent provides that understanding in a way that marketing materials and benchmark results cannot. It also makes it harder for a competitor to claim equivalent trustworthiness without equivalent transparency.

The Anthropic $19 billion ARR figure and the 70% Fortune 100 adoption rate suggest this bet is paying off. Enterprise buyers are choosing Claude not just because of capability but because of the institutional character signalled by the level of transparency and the quality of the reasoning in documents like Mythos.

What AI Product Teams Can Learn From the Mythos Design

Explicit value hierarchies reduce implicit conflicts

Most AI product teams have implicit values hierarchies. The product should be helpful, safe, and honest, but when these conflict, how should the system behave? Without an explicit hierarchy, the answer is inconsistent: it depends on how the specific case was trained, which edge cases were caught in red-teaming, and which team member's intuition prevailed in the last ambiguous case.

Anthropic made the hierarchy explicit, documented the reasoning, and committed to it publicly. This creates consistency across a much larger surface area than implicit training can achieve, and it creates accountability that is impossible without explicitness.

Recoverability is a design goal, not just a safety feature

The recoverability principle from Mythos is broadly applicable beyond AI value design. Systems that fail gracefully and maintain correction mechanisms are preferable to systems that are right more often but fail catastrophically when they fail. This is a software design principle. Anthropic applied it to AI values design in a way that is instructive for anyone building systems that make consequential decisions.

Publishing reasoning builds more trust than publishing claims

The Mythos document does not claim that Claude is safe. It explains how Claude was designed to behave, why the specific design choices were made, and what the reasoning is for prioritisation decisions. That level of reasoning transparency builds a qualitatively different kind of trust from a safety certification or a benchmark result. It is auditable, challengeable, and improvable in a way that a claim is not.

Conclusion

The Mythos model specification is interesting as an AI artifact. It is more interesting as a case study in how to build institutional trust through values transparency at scale. Anthropic's decision to design and document an explicit value hierarchy, with reasoning for each priority decision, produced a model that behaves more consistently in edge cases and a company that enterprise buyers trust more than competitors with equivalent technical capability. The lesson for AI product teams is not to copy Anthropic's specific hierarchy. It is to have an explicit one, to document the reasoning behind it, and to make that reasoning visible to the buyers who need to trust the system before they deploy it.

Matheus Vizotto
Matheus Vizotto·Growth Marketer & AI Specialist · Sydney, AU

Growth marketer and AI operator based in Sydney, Australia. Currently at VenueNow. Background across aiqfome, Hurb, and high-growth environments in Brazil and Australia. Writes on AI for marketing, growth systems, and practical strategy.