Typed Policy Rails

Technical Overview

Typed policy IR shape, rung status, and the current binder result.

The policy-rail track starts from a negative boundary in the text-side capability-gate ladder: contrastive training can install behavioral primitive gates, but tested prompt/text formats did not yield robust held-out role-to-permission composition. Typed policy rails move policy state out of prose and into a small side-channel IR.

For the upstream experiment chain, see the companion model-internals site. Its RPCG ladder tests how much role-conditioned behavior can be learned inside the model from text-side supervision. This site starts where that ladder becomes an engineering boundary: if the model does not reliably synthesize the permission lookup, compile the lookup in software and expose the result as a rail.

IR Shape

The IR is typed rather than one flat feature list:

source:
  SYSTEM, USER, TOOL, DATA, WEB

operation:
  OBEY, USE, QUOTE, TOOL_CALL, REVEAL_SECRET, EXEC, NET

risk_domain:
  DANGEROUS, CYBER, PRIVACY, HATE, HARASSMENT, SEXUAL,
  MEDICAL, FINANCIAL, LEGAL, SELF_HARM, JAILBREAK

decision_policy:
  allow, transform, refuse, escalate
source trusted?
operation allowed?
risk policy passed?

The source rail answers where a span came from. The operation rail answers what the text is trying to do. The risk rail identifies a safety surface. The policy state combines those typed axes into an allow, transform, refuse, or escalate decision.

Rung 1: Source-Only Rail

Input:

tokens:    ordinary prompt text
source_id: SYSTEM / USER / TOOL / DATA / WEB per token or span

Implementation in the first smoke: Qwen2.5-0.5B-Instruct receives learned additive source embeddings at input time. No LoRA is trained. Only the 6 x 896 source embedding table is trainable, 5,376 parameters.

Eval condition strict exact trusted-follow untrusted-suppress Read
Correct source ids 1.000 1.000 1.000 Source-only rail installs cleanly on the synthetic paired task.
Constant source eval 0.305 0.234 0.375 Visible text alone does not preserve the paired distinction.
Trusted/untrusted source swap 0.000 0.000 0.000 Behavior is causally tied to the supplied source rail.

This is positive but narrow. It supports the idea that an instruction-tuned model exposes a small input-control surface for source authority. It does not show source+operation composition, arbitrary role binding, or risk-policy generalization.

Rung 2: Source + Operation Rail

Question: does separating provenance from attempted operation create the minimum viable compositional harness?

source_id:    source of each span
operation_id: OBEY / USE / QUOTE / TOOL_CALL / REVEAL_SECRET / EXEC / NET

The oracle-operation smoke is positive. With source and operation embeddings only, Qwen2.5-0.5B-Instruct reaches strict exact 1.000 across trusted OBEY, untrusted OBEY suppression, DATA USE, and DATA QUOTE. The controls remain meaningful: constant source drops to 0.059, constant operation to 0.215, source-swap to 0.000, and OBEY/USE operation-swap to 0.438.

This says the model can use two typed local rails. It does not yet say the model can bind a global policy vector to a local operation.

Rung 3: Explicit Policy Vector

Question: can the model bind arbitrary roles to reusable primitive permissions when the active policy is a tensor rather than text?

role_name: "auditor"
policy.operation = [OBEY=1, USE=0, QUOTE=1, TOOL_CALL=0, ...]
policy.source    = trust thresholds per source

The raw additive policy-vector run fails to install the rule. With source and operation still supplied as oracle rails, the model trains only 15,232 additive embedding parameters. The full run reaches exact 0.048, seen-policy exact 0.056, and held-out 101 exact 0.000.

The overfit diagnostics sharpen the result:

Diagnostic Eval set exact Read
fixed mask 001 exact train rows 1.000 fixed-policy capacity exists
all six seen masks exact train rows 1.000 small tables can be memorized
same adapter fresh seen rows 0.261 no reusable rule across fresh rows
same adapter fresh held-out 101 0.219 no bitwise composition

So the failure is not “the task cannot be fit.” The failure is the missing binding computation:

[ \mathrm{allowed}(o) = p_o. ]

Rung 3b: Compiled Permission Rail

The successful variant computes the lookup outside the model and injects a local permission rail:

permission_for_candidate = policy_bits[operation_id]

Separate raw policy-bit embeddings are disabled. The first successful run also keeps source and operation rails, for 12,544 trainable parameters.

Metric strict exact
all examples 1.000
seen policy masks 1.000
held-out 101 mask 1.000
all OPEN/DECLINE by primitive 1.000

The working interface is therefore not a raw prompt-wide policy vector. It is a compiled operation-local permission rail. Architecturally, this is closer to a deterministic policy harness than to asking the LLM to rediscover the policy lookup from embeddings.

The sharper minimality ablation disables source, operation, and raw policy-bit embeddings. Only the 3 x 896 permission table remains trainable, 2,688 parameters total. It still reaches exact 1.000 on all examples, including the held-out 101 mask and every OPEN/DECLINE primitive cell. On this synthetic rung, the bound permission rail alone is sufficient.

PR4: Four-Cell Compositional Grid

PR4 asks whether the permission-only rail survives a compositional grid instead of a single held-out mask. The grid crosses source-policy pairing with surface template:

                seen template     novel template
seen source-policy      C1              C2
novel source-policy     C3              C4

Every source id, operation id, and policy mask appears in training. The held-out axis is the pairing between source and policy, plus the candidate wording. The same 2,688 trainable permission parameters reach exact 1.000 in C1, C2, C3, and C4. Constant-policy drops to 0.444; the stricter invert-policy trap drops to 0.000.

The old OBEY/USE swap control is not decisive on this balanced grid because it leaves QUOTE unchanged. The invert-policy control is the correct trap here.

PR5: SEP Projection Smoke

PR5 starts moving from synthetic rails to real prompt-injection surfaces. The first smoke is eval-only: load the PR4 adapter, project 200 SEP injected prompts as denied untrusted OBEY attempts, and require the model to return the fallback ANSWER.

Metric exact
denied SEP projection 0.900
constant-policy control 0.900
invert-policy control 0.465

This is an early kill under the 0.95 transfer gate. The rail still causally affects behavior because invert-policy changes many outputs, but the synthetic PR4 surface does not fully transfer to SEP-style prompts without adaptation. The next rung is PR5b: a held-out SEP-surface adaptation test with a cleaner rail-causality trap.

Pass signal The model follows the same visible text differently when the typed operation or policy vector changes.
Kill signal Oracle operation labels still fail to compose, implying the problem is policy application or architecture, not semantic detection.

Later Rungs

The next live rungs replace oracle pieces without losing the 1.000 behavior:

Rung Question
learned operation detector Can hidden states predict OBEY/USE/QUOTE well enough to feed the permission rail?
tiny binder module Can a small trained module compute (p_o) from operation id and policy bits?
auxiliary rail pretraining Can source, operation, risk, and permission heads be learned before SFT/DPO?

Risk-domain rails come after source, operation, and policy-vector behavior works. Risk domains are broader and more polysemantic than operation rails, so they should be attributes of content rather than replacements for provenance.

Source Trace

The plan comes from rope-provenance/docs/policy_ir_ladder.md. Current result artifacts are: