The Caveat Is the Theorem

When the model can tell it's being tested, the test becomes part of the threat model.

If the safety result depends on the model not knowing it is being tested, then the test is part of the threat model.

That's why I keep coming back to one sentence in Ryd, Bartsch, Stastny, Benton, and Hebbar's recent paper, Removing Sandbagging in LLMs by Training with Weak Supervision (arXiv:2604.22082):

Critically, this relies on training being indistinguishable from deployment; when models can distinguish between training and deployment, they can perform well during training while continuing to sandbag afterward.

The paper is careful and the result deserves to be taken seriously. SFT followed by RL can break a sandbagging policy and recover capability; SFT alone fails, RL alone induces reward hacking, and the combination works. That matters.

The sentence above matters more.

Ryd et al. did the field a favor by putting the condition in the abstract. They didn't hide the fragile part. The question is what we do with it. My answer: stop treating it as a caveat. Treat it as the result.

The caveat isn't that the method sometimes fails. It's that the method depends on a condition that stronger models are better positioned to violate.

The asymmetry

The training-vs-deployment indistinguishability requirement is doing a lot of work, and the work isn't symmetric. The defender has to make every distinguishing feature unreadable. The model only has to notice one. As models get better the gap grows in the model's favor, because what makes a model hard to elicit is partly what makes it good at reading environments. This isn't a claim about any specific system today. It's a claim about the limit.

Strictly stated, this is a threat-model claim, not a theorem: when the defender's ability to erase distinguishing signals does not scale with the model's ability to notice them, behavioral elicitation faces a structural ceiling. The informal version is the one worth remembering: the model only has to notice one thing. The defender has to hide everything.

(Read "theorem" in the title as "conjecture I'm willing to put my name on.")

Borrowing from a different field

I come from clinical infection control, where the central lesson is almost brutally simple: don't make safety depend on a tired person remembering to be good.

Not because clinicians are careless. Because serious systems are designed for the day a competent person is rushed, interrupted, distracted, or wrong.

Sterility works by evidence. Instruments arrive sealed. Trays carry indicators. Workflows stop at missing verification. The safety property is not routed through the clinician's private diligence. It is routed through a path that either carries the required evidence or does not form.

That's the frame I want to borrow.

Call it the non-formation paradigm: rather than trying to ensure the agent only does good things, design the system so that specified disallowed paths fail to form unless the required authorization evidence is present. The agent's intentions and deployment awareness may matter elsewhere. They are not load-bearing for this safety property.

Translate to AI:

Sterility evidence becomes cryptographically anchored authorization tokens for the action being attempted.
Workflow gating becomes execution paths that require authorization proof at structural junctures, not policy-layer checks.
Asepsis maps to invariants enforced below the policy layer: when the invariant fails, the path does not form. It is not merely logged after the fact.

The point of non-formation is not to know whether the model is honest. It is to make honesty less load-bearing. A sandbagging model and a cooperative one are subject to the same gate. The gate doesn't care which one you have.

One honest limit on the borrow, because it's the first thing a careful reader will reach for: microbes don't read the indicator strip. Sterile workflow works partly because the adversary is passive — it cannot notice the gate and design around it. Mine can. So I'm not importing the comfort of an opponent that can't adapt. I'm importing something narrower: the discipline of routing the safety property through path formation instead of operator diligence. Whether that discipline survives an adversary that does read the gate is the asymmetry argument above, and it's the whole question. The analogy doesn't get to assume it.

Three things this changes

Write paths, not write content. Most discussion of misuse treats outputs as a content problem — did the model write the wrong thing? You can also treat the write path itself as the object of control. If the structural conditions for writing to a target aren't met, the write doesn't happen. The model could have wanted to write anything.

This is where the Ryd et al. caveat bites. Suppose training elicits good behavior, the model figures out it's in deployment, and starts sandbagging. If the dangerous outputs require write paths that don't form without supervisory evidence, the deployment-time sandbagging is bounded in what it can execute, not in what we can infer it intended. We have not made the model trustworthy. We have made a class of untrusted actions non-executable. The structural problem is hard and has its own attack surfaces, but its difficulty scales differently from latent-intent inference.

The chain is the unit, not the attribute. Behavioral evaluation looks at outputs and asks if they conform to a policy. Reward hacking — the failure mode Ryd et al. flag in RL-only training — is exactly the case where the same surface output can correspond to wildly different chains of justification. Two outputs can be identical at the token level while coming from different chains. One traces to authorized goals. The other doesn't. Attribute-based inspection can't tell them apart.

One distinction I need to pin down, because it slides around in conversation: by chain I do not mean the model's internal reasoning trace. I mean the externally mediated sequence of authorization-bearing events — goal issuance, delegation, tool binding, state access, write-target selection, final commit. The chain we verify is the audit chain, not the cognitive chain. Whatever's happening inside the model's head is a separate problem with separate properties.

Much current alignment infrastructure is still attribute-shaped. Action policies, content classifiers, and many emerging approaches to agent authentication and permissioning ask what the action is, not by which justification chain it became permissible. The justification chain is one of the places reward hacking actually hides.

Failure lives in the trajectory, not the turn. Sandbagging is, by definition, benign-looking on individual outputs. So is a broader class of failure modes: cumulative information disclosure, drift, slow-roll defection. Each emission passes review. The trajectory doesn't.

Concretely: an agent with persistent memory across sessions. Every individual write is harmless and policy-compliant, verifiably so on a per-turn audit. Across hundreds of writes the aggregated memory state can encode an unauthorized objective, build a profile of a target user, or partially reconstruct sensitive information that no single write contained. Per-turn evaluation cannot see this. The disallowed object exists only at the aggregate.

The detection problem has to live at the trajectory level, with invariants on cumulative state. Improving how well we extract capability doesn't tell us whether the cumulative pattern is acceptable, because the cumulative pattern is invisible at the granularity elicitation operates at.

What this is not

It's not a critique of behavioral methods. The Ryd et al. result is good, and the model-organism methodology is the right way to study these questions.

It's not a claim that structural controls solve alignment. Structural authorization without behavioral alignment is brittle in different ways. The two are not substitutes — behavioral elicitation does what gating can't (capability access under cooperative conditions); gating does what behavioral elicitation can't (guarantees that survive the indistinguishability ceiling). A serious safety stack uses both and notices that they fail in different directions. None of this is "put alignment in hardware." That's a bad slogan, and each primitive above is a hard engineering problem with its own attack surface.

It's not a claim of novelty for the underlying mechanisms. Capability control, hardware roots of trust, policy enforcement below the model — all have precedents. The difference isn't that these mechanisms have never existed; it's what we choose to make them the unit of safety for. What's underdeveloped is treating non-formation as the organizing safety object rather than a scattered set of implementation details.

It's not a claim that the disallowed set is closed. A gate stops the disallowed paths you anticipated. A capable model can look for a path you didn't gate, or misuse one you authorized for something else. So non-formation doesn't retire intent inference; it converts an intent-inference problem into a path-coverage problem. Coverage is its own attack surface, and I don't want to pretend otherwise. The reason I still prefer it: path coverage is a design space the defender shapes and can keep auditing, while latent-intent inference gets harder exactly as the model gets more capable. The honest claim is comparative, not total.

And — this one matters — non-formation is not a control primitive wearing a safety badge. It needs governance. The authorization layer has to be auditable, contestable, and scoped: who issues authority, who can revoke it, on what evidence. Without that, the same architecture that prevents an unsafe action can prevent any action the authority happens to disallow, and you've built a control system with a safety brand. Authorization mechanisms are visible targets for governance in a way latent behavioral alignment isn't. That's a feature. It's also a responsibility.

The same caveat, here

The discipline I'm demanding of Ryd et al., I owe to my own frame. Their result leans on training being indistinguishable from deployment. So what does non-formation lean on?

It leans on the authorization chain being external to the model — issued, bound, and committed by machinery the model acts through but does not author. The moment a capable agent can draft its own justification, request its own tokens, and persuade whoever or whatever signs off, the gate is only as strong as the weakest approval path. Put it in the shape of the title: the model only has to compromise one approver. The defender has to keep it out of every authorization path. That's the same asymmetry I just used against behavioral elicitation, now pointed back at me.

I don't think it collapses the frame, and the reason isn't rhetorical. The two ceilings have different shapes. The set of distinguishing signals a model might notice in an environment is open and grows with the model's perceptiveness, which the defender doesn't control. The set of paths by which authority is granted is enumerable, external, and a design choice — the defender can count them, narrow them, reduce the number of signers, and scope each grant to least authority. When an approval path is compromised it leaves a trace in the audit chain; latent intent leaves none. So the self-attack converts a hidden-inference problem into a visible-governance one. That is harder to make airtight and easier to catch failing.

I won't overstate it. A compromised approver is a genuine failure, and social engineering of human signers is an old, hard problem that gating doesn't dissolve. Non-formation doesn't escape the caveat. It trades a ceiling I can't move — how much a model can perceive — for one I can: the size, scope, and auditability of the authorization surface. I'd rather spend my defense on the second.

Why now

Two trends.

The model-supervisor gap is widening. The gap a year from now is likely larger than the one Ryd et al. studied; three years out, larger still. The conditions under which behavioral elicitation works are exactly the conditions becoming less reliable.

Deployment surfaces are becoming agentic. Sandbagging in a chat interface is an evaluation problem. Sandbagging by an agent that writes to systems, schedules actions, and accumulates state across runs is a production problem. The aggregate failure modes — cumulative leakage, drift, chain divergence — become first-order, and single-shot evaluation doesn't match that shape.

If we wait until the behavioral ceiling is clearly binding, the structural primitives have to be retrofitted. They're easier to design than to retrofit.

Where this goes

Three directions, in increasing order of how hard I'd push them.

First, take the indistinguishability caveat as a benchmark target. Build evaluation suites where the model can distinguish training from deployment by deliberate signals, and check whether elicitation methods survive. My prediction: they don't, in the limit. I'd like to be wrong and would update on evidence.

Second, develop attestation primitives for action chains, separate from policy-layer permissions. The unit being attested isn't "the model did X" but "the chain by which X became authorized terminates in a verifiable goal." This is closer to capability-secure systems than to alignment proper, and it's understudied as an alignment intervention.

Third — the one I'd actually commit to — take seriously the design space where execution paths are the primary safety object and behavioral alignment is the cooperative layer that decides what gets attempted through them. This is a different decomposition than the policy-centric default. I think it scales further.

The most useful part of the paper may be the condition it refuses to hide. That's where the next program starts.

The win isn't that the model wants the right thing. It's that wanting the wrong thing stops being sufficient — it now also has to find a path that forms. And unlike the model's wants, that path is something we can see, scope, and contest.

Reference: Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar. Removing Sandbagging in LLMs by Training with Weak Supervision. arXiv:2604.22082. https://arxiv.org/abs/2604.22082

About me: I run Certum Systems, a small independent AI safety IP lab in Korea, and my background is in clinical infection control. Because related patent filings are pending, this post stays at the architectural level and avoids implementation details or specific claim structures.

If any of this is interesting, useful, or wrong, I'd like to hear it.

— Jungsoo Baek · 백정수 Founder, Certum Systems · July 2026

Mediated Write-Path (MWP) · Causal Chain Authorization (CCA)

Research project pages · paper · poster · demo · patent status

If any of this is interesting, useful, or wrong: certumsystems@gmail.com