Technical Overview
Typed policy IR shape, rung status, and the current binder result.
The policy-rail track starts from a negative boundary in the text-side capability-gate ladder: contrastive training can install behavioral primitive gates, but tested prompt/text formats did not yield robust held-out role-to-permission composition. Typed policy rails move policy state out of prose and into a small side-channel IR.
For the upstream experiment chain, see the companion model-internals site. Its RPCG ladder tests how much role-conditioned behavior can be learned inside the model from text-side supervision. This site starts where that ladder becomes an engineering boundary: if the model does not reliably synthesize the permission lookup, compile the lookup in software and expose the result as a rail.
IR Shape
The IR is typed rather than one flat feature list:
source:
SYSTEM, USER, TOOL, DATA, WEB
operation:
OBEY, USE, QUOTE, TOOL_CALL, REVEAL_SECRET, EXEC, NET
risk_domain:
DANGEROUS, CYBER, PRIVACY, HATE, HARASSMENT, SEXUAL,
MEDICAL, FINANCIAL, LEGAL, SELF_HARM, JAILBREAK
decision_policy:
allow, transform, refuse, escalate
The source rail answers where a span came from. The operation rail answers what the text is trying to do. The risk rail identifies a safety surface. The policy state combines those typed axes into an allow, transform, refuse, or escalate decision.
Rung 1: Source-Only Rail
Input:
tokens: ordinary prompt text
source_id: SYSTEM / USER / TOOL / DATA / WEB per token or span
Implementation in the first smoke: Qwen2.5-0.5B-Instruct receives learned additive source embeddings at input time. No LoRA is trained. Only the 6 x 896 source embedding table is trainable, 5,376 parameters.
| Eval condition | strict exact | trusted-follow | untrusted-suppress | Read |
|---|---|---|---|---|
| Correct source ids | 1.000 | 1.000 | 1.000 | Source-only rail installs cleanly on the synthetic paired task. |
| Constant source eval | 0.305 | 0.234 | 0.375 | Visible text alone does not preserve the paired distinction. |
| Trusted/untrusted source swap | 0.000 | 0.000 | 0.000 | Behavior is causally tied to the supplied source rail. |
This is positive but narrow. It supports the idea that an instruction-tuned model exposes a small input-control surface for source authority. It does not show source+operation composition, arbitrary role binding, or risk-policy generalization.
Rung 2: Source + Operation Rail
Question: does separating provenance from attempted operation create the minimum viable compositional harness?
source_id: source of each span
operation_id: OBEY / USE / QUOTE / TOOL_CALL / REVEAL_SECRET / EXEC / NET
The oracle-operation smoke is positive. With source and operation embeddings only, Qwen2.5-0.5B-Instruct reaches strict exact 1.000 across trusted OBEY, untrusted OBEY suppression, DATA USE, and DATA QUOTE. The controls remain meaningful: constant source drops to 0.059, constant operation to 0.215, source-swap to 0.000, and OBEY/USE operation-swap to 0.438.
This says the model can use two typed local rails. It does not yet say the model can bind a global policy vector to a local operation.
Rung 3: Explicit Policy Vector
Question: can the model bind arbitrary roles to reusable primitive permissions when the active policy is a tensor rather than text?
role_name: "auditor"
policy.operation = [OBEY=1, USE=0, QUOTE=1, TOOL_CALL=0, ...]
policy.source = trust thresholds per source
The raw additive policy-vector run fails to install the rule. With source and
operation still supplied as oracle rails, the model trains only 15,232 additive
embedding parameters. The full run reaches exact 0.048, seen-policy exact
0.056, and held-out 101 exact 0.000.
The overfit diagnostics sharpen the result:
| Diagnostic | Eval set | exact | Read |
|---|---|---|---|
fixed mask 001 |
exact train rows | 1.000 | fixed-policy capacity exists |
| all six seen masks | exact train rows | 1.000 | small tables can be memorized |
| same adapter | fresh seen rows | 0.261 | no reusable rule across fresh rows |
| same adapter | fresh held-out 101 |
0.219 | no bitwise composition |
So the failure is not “the task cannot be fit.” The failure is the missing binding computation:
[ \mathrm{allowed}(o) = p_o. ]
Rung 3b: Compiled Permission Rail
The successful variant computes the lookup outside the model and injects a local permission rail:
permission_for_candidate = policy_bits[operation_id]
Separate raw policy-bit embeddings are disabled. The first successful run also keeps source and operation rails, for 12,544 trainable parameters.
| Metric | strict exact |
|---|---|
| all examples | 1.000 |
| seen policy masks | 1.000 |
held-out 101 mask |
1.000 |
| all OPEN/DECLINE by primitive | 1.000 |
The working interface is therefore not a raw prompt-wide policy vector. It is a compiled operation-local permission rail. Architecturally, this is closer to a deterministic policy harness than to asking the LLM to rediscover the policy lookup from embeddings.
The sharper minimality ablation disables source, operation, and raw policy-bit
embeddings. Only the 3 x 896 permission table remains trainable, 2,688
parameters total. It still reaches exact 1.000 on all examples, including the
held-out 101 mask and every OPEN/DECLINE primitive cell. On this synthetic
rung, the bound permission rail alone is sufficient.
PR4: Four-Cell Compositional Grid
PR4 asks whether the permission-only rail survives a compositional grid instead of a single held-out mask. The grid crosses source-policy pairing with surface template:
seen template novel template
seen source-policy C1 C2
novel source-policy C3 C4
Every source id, operation id, and policy mask appears in training. The held-out axis is the pairing between source and policy, plus the candidate wording. The same 2,688 trainable permission parameters reach exact 1.000 in C1, C2, C3, and C4. Constant-policy drops to 0.444; the stricter invert-policy trap drops to 0.000.
The old OBEY/USE swap control is not decisive on this balanced grid because it leaves QUOTE unchanged. The invert-policy control is the correct trap here.
PR5: SEP Projection Smoke
PR5 starts moving from synthetic rails to real prompt-injection surfaces. The first smoke is eval-only: load the PR4 adapter, project 200 SEP injected prompts as denied untrusted OBEY attempts, and require the model to return the fallback ANSWER.
| Metric | exact |
|---|---|
| denied SEP projection | 0.900 |
| constant-policy control | 0.900 |
| invert-policy control | 0.465 |
This is an early kill under the 0.95 transfer gate. The rail still causally affects behavior because invert-policy changes many outputs, but the synthetic PR4 surface does not fully transfer to SEP-style prompts without adaptation. The next rung is PR5b: a held-out SEP-surface adaptation test with a cleaner rail-causality trap.
Later Rungs
The next live rungs replace oracle pieces without losing the 1.000 behavior:
| Rung | Question |
|---|---|
| learned operation detector | Can hidden states predict OBEY/USE/QUOTE well enough to feed the permission rail? |
| tiny binder module | Can a small trained module compute (p_o) from operation id and policy bits? |
| auxiliary rail pretraining | Can source, operation, risk, and permission heads be learned before SFT/DPO? |
Risk-domain rails come after source, operation, and policy-vector behavior works. Risk domains are broader and more polysemantic than operation rails, so they should be attributes of content rather than replacements for provenance.
Source Trace
The plan comes from rope-provenance/docs/policy_ir_ladder.md. Current result
artifacts are:
rope-provenance/results/slm/qwen25_0_5b_instruct_source_rail_s0.jsonrope-provenance/results/slm/qwen25_0_5b_instruct_source_operation_rail_s0.jsonrope-provenance/results/slm/qwen25_0_5b_instruct_policy_vector_s0.jsonrope-provenance/results/slm/qwen25_0_5b_instruct_permission_rail_s0.json