Technical Overview

Typed policy IR shape, rung status, and the current binder result.

The policy-rail track starts from a negative boundary in the text-side capability-gate ladder: contrastive training can install behavioral primitive gates, but tested prompt/text formats did not yield robust held-out role-to-permission composition. Typed policy rails move policy state out of prose and into a small side-channel IR.

For the upstream experiment chain, see the companion model-internals site. Its RPCG ladder tests how much role-conditioned behavior can be learned inside the model from text-side supervision. This site starts where that ladder becomes an engineering boundary: if the model does not reliably synthesize the permission lookup, compile the lookup in software and expose the result as a rail.

IR Shape

The IR is typed rather than one flat feature list:

source:
  SYSTEM, USER, TOOL, DATA, WEB

operation:
  OBEY, USE, QUOTE, TOOL_CALL, REVEAL_SECRET, EXEC, NET

risk_domain:
  DANGEROUS, CYBER, PRIVACY, HATE, HARASSMENT, SEXUAL,
  MEDICAL, FINANCIAL, LEGAL, SELF_HARM, JAILBREAK

decision_policy:
  allow, transform, refuse, escalate

source trusted?

operation allowed?

risk policy passed?

The source rail answers where a span came from. The operation rail answers what the text is trying to do. The risk rail identifies a safety surface. The policy state combines those typed axes into an allow, transform, refuse, or escalate decision.

Rung 1: Source-Only Rail

Input:

tokens:    ordinary prompt text
source_id: SYSTEM / USER / TOOL / DATA / WEB per token or span

Implementation in the first smoke: Qwen2.5-0.5B-Instruct receives learned additive source embeddings at input time. No LoRA is trained. Only the 6 x 896 source embedding table is trainable, 5,376 parameters.

Eval condition	strict exact	trusted-follow	untrusted-suppress	Read
Correct source ids	1.000	1.000	1.000	Source-only rail installs cleanly on the synthetic paired task.
Constant source eval	0.305	0.234	0.375	Visible text alone does not preserve the paired distinction.
Trusted/untrusted source swap	0.000	0.000	0.000	Behavior is causally tied to the supplied source rail.

This is positive but narrow. It supports the idea that an instruction-tuned model exposes a small input-control surface for source authority. It does not show source+operation composition, arbitrary role binding, or risk-policy generalization.

Rung 2: Source + Operation Rail

Question: does separating provenance from attempted operation create the minimum viable compositional harness?

source_id:    source of each span
operation_id: OBEY / USE / QUOTE / TOOL_CALL / REVEAL_SECRET / EXEC / NET

The oracle-operation smoke is positive. With source and operation embeddings only, Qwen2.5-0.5B-Instruct reaches strict exact 1.000 across trusted OBEY, untrusted OBEY suppression, DATA USE, and DATA QUOTE. The controls remain meaningful: constant source drops to 0.059, constant operation to 0.215, source-swap to 0.000, and OBEY/USE operation-swap to 0.438.

This says the model can use two typed local rails. It does not yet say the model can bind a global policy vector to a local operation.

Rung 3: Explicit Policy Vector

Question: can the model bind arbitrary roles to reusable primitive permissions when the active policy is a tensor rather than text?

role_name: "auditor"
policy.operation = [OBEY=1, USE=0, QUOTE=1, TOOL_CALL=0, ...]
policy.source    = trust thresholds per source

The raw additive policy-vector run fails to install the rule. With source and operation still supplied as oracle rails, the model trains only 15,232 additive embedding parameters. The full run reaches exact 0.048, seen-policy exact 0.056, and held-out 101 exact 0.000.

The overfit diagnostics sharpen the result:

Diagnostic	Eval set	exact	Read
fixed mask `001`	exact train rows	1.000	fixed-policy capacity exists
all six seen masks	exact train rows	1.000	small tables can be memorized
same adapter	fresh seen rows	0.261	no reusable rule across fresh rows
same adapter	fresh held-out `101`	0.219	no bitwise composition

So the failure is not “the task cannot be fit.” The failure is the missing binding computation:

[ \mathrm{allowed}(o) = p_o. ]

See also for Rung 3a: Overview kill diagnostic · Rung 3a in Explainer · Literature: Role Confusion

Rung 3b: Compiled Permission Rail

The successful variant computes the lookup outside the model and injects a local permission rail:

permission_for_candidate = policy_bits[operation_id]

Separate raw policy-bit embeddings are disabled. The first successful run also keeps source and operation rails, for 12,544 trainable parameters.

Metric	strict exact
all examples	1.000
seen policy masks	1.000
held-out `101` mask	1.000
all OPEN/DECLINE by primitive	1.000

The working interface is therefore not a raw prompt-wide policy vector. It is a compiled operation-local permission rail. Architecturally, this is closer to a deterministic policy harness than to asking the LLM to rediscover the policy lookup from embeddings.

The sharper minimality ablation disables source, operation, and raw policy-bit embeddings. Only the 3 x 896 permission table remains trainable, 2,688 parameters total. It still reaches exact 1.000 on all examples, including the held-out 101 mask and every OPEN/DECLINE primitive cell. On this synthetic rung, the bound permission rail alone is sufficient.

See also for Rung 3b: Overview architecture · Rung 3b in Explainer · Literature: Spotlighting, RepE

PR4: Four-Cell Compositional Grid

PR4 asks whether the permission-only rail survives a compositional grid instead of a single held-out mask. The grid crosses source-policy pairing with surface template:

                seen template     novel template
seen source-policy      C1              C2
novel source-policy     C3              C4

Every source id, operation id, and policy mask appears in training. The held-out axis is the pairing between source and policy, plus the candidate wording. The same 2,688 trainable permission parameters reach exact 1.000 in C1, C2, C3, and C4. Constant-policy drops to 0.444; the stricter invert-policy trap drops to 0.000.

The old OBEY/USE swap control is not decisive on this balanced grid because it leaves QUOTE unchanged. The invert-policy control is the correct trap here.

Provenance note: the backing artifact for these numbers is results/slm/qwen25_0_5b_instruct_pr4_permission_grid_s0.json in the external rope-provenance repo (trainable_params: 2688; c1/c2/c3/c4 exact: 1.000; constant-policy: 0.444; invert-policy: 0.000). The rope-provenance repo is not a submodule of lean-mining; results live there, not here.

PR5: SEP Projection Smoke

PR5 starts moving from synthetic rails to real prompt-injection surfaces. The first smoke is eval-only: load the PR4 adapter, project 200 SEP injected prompts as denied untrusted OBEY attempts, and require the model to return the fallback ANSWER.

Metric	exact
denied SEP projection	0.900
constant-policy control	0.900
invert-policy control	0.465

This is an early kill under the 0.95 transfer gate. The rail still causally affects behavior because invert-policy changes many outputs, but the synthetic PR4 surface does not fully transfer to SEP-style prompts without adaptation.

Trap-shape caveat. PR5 is a single-cell eval: all 200 rows are decline_obey (DENIED projection). Because constant-policy also collapses the policy to “always deny,” it trivially scores 0.900 — identical to the rail row. The constant-policy control provides zero discrimination here; it is non-informative, not a passing control. The only causal evidence is that invert-policy drops to 0.465. Read PR5 as “the rail stays causal on a deny-only surface but cannot be said to correctly model open vs. deny from this eval alone.” PR5b remedies this by making the surface symmetric (DENIED and ALLOWED rows paired on identical visible text).

PR5b: Paired SEP Adaptation

PR5b made the SEP surface symmetric. The same visible prompt appears twice: once with hidden policy DENIED -> ANSWER, and once with hidden policy ALLOWED -> witness. Before adaptation, the PR4 adapter scored 0.395 exact. After 300 adaptation steps, held-out paired SEP reached 1.000 exact with OPEN_OBEY 1.000 and DECLINE_OBEY 1.000.

Metric	pre-adapt	adapted
exact	0.395	1.000
OPEN_OBEY	0.040	1.000
DECLINE_OBEY	0.750	1.000
constant-policy control	0.375	0.500
invert-policy control	0.230	0.000

Lesson: the rail transfers to a benchmark-like injection surface, but not for free. Surface adaptation was necessary; the invert trap confirms that the adapted behavior is still rail-causal.

PR6: Operation Detector Preflight

PR6 asked whether frozen Qwen activations already expose enough semantic information to replace the oracle attempted-operation label. A linear probe fits seen templates perfectly but drops on held-out templates.

Metric	value
train accuracy	1.000
eval accuracy	0.807
C1 seen source-policy x seen template	1.000
C2 seen source-policy x held-out template	0.615
C3 held-out source-policy x seen template	1.000
C4 held-out source-policy x held-out template	0.615
shuffled-label trap	0.380

This is an early kill. The detector learns real template signal, but not a template-invariant OBEY/USE/QUOTE operation detector. The software stack still has to provide the attempted operation for the current rail.

PR7: Tiny Binder Boundary

PR7 tried to replace the software lookup with a small learned binder: [policy_bits, operation_onehot] -> permission rail. The binder solves a simple mask task, but fails to transfer to the PR4 template grid.

Rung	exact	C1	C2	C3	C4	invert-policy
PR7 simple mask	1.000	-	-	-	-	0.000
PR7b PR4 grid	0.722	1.000	0.448	1.000	0.438	0.229

The learned compiler is not robust enough yet. For span scaling, the ladder therefore returns to the strongest software compiler instead of stacking binder fragility on top of span-binding difficulty.

PR8: Multi-Span Boundary And PR8b Fix

PR8 adds a second operation-labeled span: one primary candidate and one distractor candidate. The expected output is the primary value if the primary span is allowed, otherwise fallback ANSWER.

Metric	value
exact	0.965
C1 seen source-policy x seen template	1.000
C2 seen source-policy x held-out template	0.938
C3 held-out source-policy x seen template	1.000
C4 held-out source-policy x held-out template	0.917
held-out-template exact	0.931
constant-policy control	0.444
invert-policy control	0.003

The rail is causal and strong, but not clean enough under the all-cell gate. The weak axis is held-out multi-span templates, not source-policy recombination.

PR8b fixed the endpoint at 200 steps, enlarged evaluation to 2304 rows with held-out values, and added error-type diagnostics. It clears the gate:

Metric	PR8b value
exact	0.989
C2 seen source-policy x held-out template	0.982
C4 held-out source-policy x held-out template	0.969
held-out-template exact	0.977
distractor error rate	0.000
constant-policy control	0.444
invert-policy control	0.002

The PR8 failure was not wrong-span bleed. The remaining misses are mostly formatting/other outputs on held-out templates. PR9 unblocked on this fix.

PR9 scale replication, boundary, and fix. Qwen2.5-1.5B-Instruct under the same 200-step protocol reaches exact 0.948; C1 / C3 = 0.979, C2 / C4 = 0.917; OPEN_OBEY 0.813 vs DECLINE_OBEY 1.000; constant-policy 0.444, invert-policy 0.017, distractor errors 0.000. The rail remains causal and span-bound, but the weak axis is held-out templates on the OPEN side.

PR9b tested the “just needs more samples” hypothesis. It got worse by step 300: exact 0.882 and C4 0.760 while invert-policy reached 0.000. The concrete failure was copying full held-out carrier phrases instead of returning the bare candidate value.

PR9c fixed that interface by making candidate values explicit:

VALUE=orange circuit
[orange circuit]
<value>orange circuit</value>

With value boundaries, 1.5B reaches exact 1.000 on every cell and every OPEN/DECLINE primitive bucket. Constant-policy remains 0.444; invert-policy is 0.000; all error rates are 0.000. PR10 keeps that constraint: risk-domain interfaces must keep policy and value boundaries explicit.

PR10 synthetic risk rail. PR10 keeps the PR9c value-delimited interface and adds a separate local risk rail. The software compiler still supplies permission; the model receives a second candidate-span attribute: SAFE, SENSITIVE, or HARMFUL.

if permission denies:
    ANSWER
elif risk is HARMFUL:
    REFUSE
else:
    candidate value

Only 10,752 parameters train: 4,608 permission-rail parameters and 6,144 risk-rail parameters. No LoRA, source embedding, operation embedding, or raw policy-bit embedding is enabled.

Metric	Value
exact_match	0.995
C1 / C2 / C3 / C4	0.997 / 0.997 / 0.993 / 0.993
risk_allow_exact	0.988
risk_refuse_exact	1.000
permission_decline_exact	1.000
distractor error	0.000
constant-policy / invert-policy	0.444 / 0.005
constant-risk / invert-risk	0.773 / 0.444

Interpretation: the risk rail composes with the permission rail on the synthetic grid. The constant_risk score is intentionally not zero: if risk is removed, permission-denied fallbacks should still work, and many safe/sensitive allow rows should remain easy. The decisive risk trap is that invert-risk collapses to fallback-like behavior while correct risk keeps harmful refusals at 1.000.

Industry Benchmark Map

These benchmark families fit different ladder levels. They are not interchangeable.

Benchmark family	Ladder role
SEP, TensorTrust, PromptInject, InjecTQA	Prompt-injection and context-hijacking projection for PR5/PR8/PR9.
HarmBench, JailbreakBench	Harmful-behavior and jailbreak ASR. Next rung — external projection of the synthetic PR10 risk-rail.
XSTest	Over-refusal and safety-creep tax. External projection alongside HarmBench.
WildGuard / WildChat	Moderator and refusal-classification comparison. External projection alongside HarmBench.

See also for PR5 / PR5b: Overview PR5b paragraph · Rung 3b in Explainer · Literature: Spotlighting, RepE

Later Rungs

The next live rung is external projection: risk-domain rails on industry safety benchmarks, and the separate PR6 operation-detector retry. Carry forward the PR9c lesson: make policy/value boundaries explicit instead of asking the model to infer the target substring from arbitrary prose.

Risk-domain rails come after source, operation, and policy-vector behavior works. PR10 shows a minimal synthetic risk rail can coexist with provenance, but real risk domains are broader and more polysemantic than operation rails, so they should remain attributes of content rather than replacements for provenance.

Source Trace

The plan comes from docs/policy_ir_ladder.md in the external rope-provenance repo. rope-provenance is not a submodule of lean-mining (check .gitmodules — it is not listed). All result artifacts live in that public repo; paths below are relative to its root. A reader can clone or browse the repo at https://github.com/d3banjan/rope-provenance to inspect them directly.

Current result artifacts:

results/slm/qwen25_0_5b_instruct_source_rail_s0.json
results/slm/qwen25_0_5b_instruct_source_operation_rail_s0.json
results/slm/qwen25_0_5b_instruct_policy_vector_s0.json
results/slm/qwen25_0_5b_instruct_permission_rail_s0.json
results/slm/qwen25_0_5b_instruct_pr4_permission_grid_s0.json
results/slm/qwen25_0_5b_instruct_pr5_sep_projection_eval_s0.json
results/slm/qwen25_0_5b_instruct_pr5b_sep_paired_pre_adapt_s0.json
results/slm/qwen25_0_5b_instruct_pr5b_sep_paired_s0.json
results/slm/qwen25_0_5b_instruct_pr6_operation_detector_s0.json
results/slm/qwen25_0_5b_instruct_pr7_tiny_binder_s0.json
results/slm/qwen25_0_5b_instruct_pr7_binder_grid_s0.json
results/slm/qwen25_0_5b_instruct_pr8_multispan_oracle_s0.json
results/slm/qwen25_0_5b_instruct_pr8b_multispan_oracle_step200_s0.json
results/slm/qwen25_1_5b_instruct_pr9_multispan_oracle_step200_s0.json
results/slm/qwen25_1_5b_instruct_pr9c_multispan_value_delimited_s0.json
results/slm/qwen25_1_5b_instruct_pr10_risk_value_delimited_s0.json