Typed Policy Rails
A forward architecture track for making source, operation, risk, and policy state explicit.
Typed policy rails are the next step after the model-internals capability-gate work. The earlier text-side gates showed that small alignment updates can suppress or open primitive behaviors, but they kept hitting the same boundary: models fit local role patterns more easily than they compose unseen role/permission combinations.
This track changes the interface. Instead of asking a model to infer the whole policy state from prose, the software stack supplies a small typed side channel: where text came from, what operation is being attempted, what risk surface is involved, and what the active policy allows.
One-Screen Summary
Rails, Not More Prompt Formatting
The policy rail hypothesis is:
A model can learn compositional provenance control more reliably if policy state is represented as typed side-channel structure rather than text.
The model still reads messy language. The difference is that the final behavior should route through policy state supplied by the system:
span text -> hidden semantic detector -> attempted operation P
software -> policy IR -> allowed(P)
decision = OPEN iff allowed(P) and source/risk constraints pass
This separates three jobs that text-only training entangles:
| Job | Text-side gates | Typed policy rails |
|---|---|---|
| Detect intent | Learned from text | Still learned from text |
| Read authority | Inferred from prose or tags | Supplied as typed source state |
| Apply policy | Blended into generation | Indexed by explicit policy state |
Current Rung Status
The source-only smoke trained only the 6 x 896 additive source embeddings in
Qwen2.5-0.5B-Instruct, 5,376 trainable parameters total. The model receives
ordinary prompt text plus per-token source ids such as SYSTEM, USER, DATA,
and WEB.
Interpretation: source labels help and the behavior is causally tied to the supplied source rail. This is still a smoke test. Source says where text came from, not which operation the text is attempting or which policy vector should apply.
The next rungs are now also run. A source+operation rail reaches strict exact 1.000 with 9,856 trainable parameters. But a raw policy-bit vector does not install a reusable binding rule: the full run lands at 0.048 exact, and tiny overfits memorize training rows without transferring to fresh rows. The working variant compiles the policy vector first:
permission_for_candidate = policy_bits[operation_id]
When that ALLOWED/DENIED permission rail is injected locally on candidate spans, strict exact reaches 1.000 on both seen policy masks and the held-out OBEY+QUOTE mask. The sharper minimality test disables source, operation, and raw policy-bit embeddings; the permission rail alone still reaches 1.000 with only 2,688 trainable parameters.
Pages
- Explainer -> explains the idea without assuming machine-learning background.
- Technical overview -> records the typed IR shape, ladder status, and next tests.
Relation To Model Internals
This is a sibling track to the model-internals microsite, not a replacement for it. Both tracks ask the same alignment-engineering question: can we make a model reliably distinguish what text says from what that text is allowed to do?
The difference is where the burden is placed.
| Track | Burden placed on | Main finding |
|---|---|---|
| Model internals | The model learns role-to-permission structure from text-side training. | Small gates can be installed, but held-out role/permission combinations did not compose reliably. |
| Policy rails | The software stack supplies typed policy state, and the model learns to use the local rail. | Raw policy bits still failed, but a compiled local permission rail reached 1.000. |
So the model-internals track maps the boundary of what fine-tuning alone can make the model internalize. Policy rails are the engineering response: keep the useful primitive vocabulary, but compile policy decisions into a typed side channel that the model can consume locally.