Typed Policy Rails

Typed Policy Rails

A forward architecture track for making source, operation, risk, and policy state explicit.

Typed policy rails are the next step after the model-internals capability-gate work. The earlier text-side gates showed that small alignment updates can suppress or open primitive behaviors, but they kept hitting the same boundary: models fit local role patterns more easily than they compose unseen role/permission combinations.

This track changes the interface. Instead of asking a model to infer the whole policy state from prose, the software stack supplies a small typed side channel: where text came from, what operation is being attempted, what risk surface is involved, and what the active policy allows.

Visible text
+
Typed rails
->
Policy-indexed decision

One-Screen Summary

Problem Text-only role gates learned suppression, but did not reliably bind new roles to reusable permission parts.
Architecture move Represent source, operation, risk, and policy as typed out-of-band state instead of prompt prose.
Current evidence Source and operation rails work. Raw policy bits do not compose. A compiled permission rail reaches 1.000.

Rails, Not More Prompt Formatting

The policy rail hypothesis is:

A model can learn compositional provenance control more reliably if policy state is represented as typed side-channel structure rather than text.

The model still reads messy language. The difference is that the final behavior should route through policy state supplied by the system:

span text -> hidden semantic detector -> attempted operation P
software  -> policy IR              -> allowed(P)

decision = OPEN iff allowed(P) and source/risk constraints pass

This separates three jobs that text-only training entangles:

Job Text-side gates Typed policy rails
Detect intent Learned from text Still learned from text
Read authority Inferred from prose or tags Supplied as typed source state
Apply policy Blended into generation Indexed by explicit policy state

Current Rung Status

Rung 1: source-only positive Rung 2: source + operation positive Rung 3a: raw policy bits fail Rung 3b: permission rail positive

The source-only smoke trained only the 6 x 896 additive source embeddings in Qwen2.5-0.5B-Instruct, 5,376 trainable parameters total. The model receives ordinary prompt text plus per-token source ids such as SYSTEM, USER, DATA, and WEB.

Correct source ids 1.000 strict exact match, trusted-follow, and untrusted-suppress.
Constant source control 0.305 strict exact match when source information is removed at eval.
Trusted/untrusted swap 0.000 strict exact match when source ids are swapped at eval.

Interpretation: source labels help and the behavior is causally tied to the supplied source rail. This is still a smoke test. Source says where text came from, not which operation the text is attempting or which policy vector should apply.

The next rungs are now also run. A source+operation rail reaches strict exact 1.000 with 9,856 trainable parameters. But a raw policy-bit vector does not install a reusable binding rule: the full run lands at 0.048 exact, and tiny overfits memorize training rows without transferring to fresh rows. The working variant compiles the policy vector first:

permission_for_candidate = policy_bits[operation_id]

When that ALLOWED/DENIED permission rail is injected locally on candidate spans, strict exact reaches 1.000 on both seen policy masks and the held-out OBEY+QUOTE mask. The sharper minimality test disables source, operation, and raw policy-bit embeddings; the permission rail alone still reaches 1.000 with only 2,688 trainable parameters.

Pages

Relation To Model Internals

This is a sibling track to the model-internals microsite, not a replacement for it. Both tracks ask the same alignment-engineering question: can we make a model reliably distinguish what text says from what that text is allowed to do?

The difference is where the burden is placed.

Track Burden placed on Main finding
Model internals The model learns role-to-permission structure from text-side training. Small gates can be installed, but held-out role/permission combinations did not compose reliably.
Policy rails The software stack supplies typed policy state, and the model learns to use the local rail. Raw policy bits still failed, but a compiled local permission rail reached 1.000.

So the model-internals track maps the boundary of what fine-tuning alone can make the model internalize. Policy rails are the engineering response: keep the useful primitive vocabulary, but compile policy decisions into a typed side channel that the model can consume locally.