Typed Policy Rails

Typed Policy Rails

A forward architecture track for making source, operation, risk, and policy state explicit.

Typed policy rails are the next step after the model-internals capability-gate work. The earlier text-side gates showed that small alignment updates can suppress or open primitive behaviors, but they kept hitting the same boundary: models fit local role patterns more easily than they compose unseen role/permission combinations.

This track changes the interface. Instead of asking a model to infer the whole policy state from prose, the software stack supplies a small typed side channel: where text came from, what operation is being attempted, what risk surface is involved, and what the active policy allows.

Visible text
+
Typed rails
->
Policy-indexed decision

One-Screen Summary

Problem Text-only role gates learned suppression, but did not reliably bind new roles to reusable permission parts.
Architecture move Represent source, operation, risk, and policy as typed out-of-band state instead of prompt prose.
Current evidence Compiled permission rail reaches 1.000 on the PR4 grid, PR5b SEP transfer, PR8b multi-span, and PR9c 1.5B scale (with value-delimited candidate surface). Raw policy bits do not compose. PR10 then adds a separate risk rail and clears the synthetic risk rung at 0.995.
Scope — what this is and isn't

Six bounds on the claims made on this site, surfaced up front so they're not buried in caveats further down.

  1. Not yet a jailbreak-resistance result. Benchmark transfer is to SEP-style indirect prompt injection (PR5b) and a synthetic multi-span retrieval corpus (PR8b / PR9c). PR10 adds a synthetic risk-domain rail (SAFE / SENSITIVE / HARMFUL) on top and clears at 0.995 with the correct trap shape, but is not yet HarmBench / JailbreakBench / XSTest / WildGuard / TensorTrust / BIPIA.
  2. Operation detection is unsolved. Every positive rung here takes operation_id as an oracle input from the host. The learned detector rung (PR6) early-killed at 0.615 held-out. Production-readiness is bounded by this.
  3. Mechanism is described, not causally traced. We have not run activation patching, causal mediation, or path patching to localise the rail's effect to specific heads or layers. Mechanistic stories on this site are explicitly labelled hypotheses.
  4. Capability bound is Wikitext-only. No MMLU / HumanEval / IFEval / reasoning bench measured. Wikitext perplexity is necessary but not sufficient.
  5. Single architecture family. All rungs are Qwen2.5 (0.5B and 1.5B Instruct). No Llama / Mistral / Gemma replication yet.
  6. No Lean theorem yet for the forward gate. The output-mask soundness theorem is sketched (Attempt 5 in the Explainer) but not written.

The companion text-side RPCG track is separate work with different methods and mostly negative results; claims from that track are not transitively attributed to the policy-rails architecture.

Rails, Not More Prompt Formatting

The policy rail hypothesis is:

A model can learn compositional provenance control more reliably if policy state is represented as typed side-channel structure rather than text.

The model still reads messy language. The difference is that the final behavior should route through policy state supplied by the system:

span text -> hidden semantic detector -> attempted operation P
software  -> policy IR              -> allowed(P)

decision = OPEN iff allowed(P) and source/risk constraints pass

This separates three jobs that text-only training entangles:

Job Text-side gates Typed policy rails
Detect intent Learned from text Still learned from text
Read authority Inferred from prose or tags Supplied as typed source state
Apply policy Blended into generation Indexed by explicit policy state

How It Works (Architecture)

A transformer has one input door (text) and one output door (next word). The residual stream is a long list of numbers per text piece that every layer reads and writes. Today, all safety information has to compete with content in that single stream.

Policy rails add a second input port for typed metadata, and optionally a deterministic output mask on decision tokens. The transformer floors themselves are not modified.

Baseline transformer

flowchart TB
  T["text pieces<br/><i>hello, world</i>"]
  W["word lookup table<br/>(token id → 896 numbers)"]
  F0["Floor 0"]
  Fdot["…"]
  FN["Floor N"]
  OP["output projection"]
  Y(("next word"))
  T --> W --> F0 --> Fdot --> FN --> OP --> Y
  classDef frozen fill:#f3f3f0,stroke:#64676d,color:#18191b;
  class W,F0,Fdot,FN,OP frozen;

One door in, one door out. The residual stream is a long list of numbers per token that every floor reads and writes.

Rung 1: source rail ✅ (done)

flowchart TB
  T["text pieces"]
  TAG["source tag<br/><i>USER, WEB, SYSTEM, …</i>"]
  W["word lookup<br/>(frozen)"]
  R["source lookup<br/>6 rows × 896<br/><b>trained · 5,376 params</b>"]
  SUM(("⊕"))
  F["Floor 0 … Floor N<br/>(frozen)"]
  OP["output projection<br/>(frozen)"]
  Y(("next word"))
  T --> W --> SUM
  TAG --> R --> SUM
  SUM --> F --> OP --> Y
  classDef frozen fill:#f3f3f0,stroke:#64676d,color:#18191b;
  classDef trained fill:#e8f4f6,stroke:#28666e,stroke-width:2px,color:#18191b;
  class W,F,OP frozen;
  class R trained;

Only the source lookup table is trained. Strict exact 1.000 with the source rail supplied, 0.305 with source removed, 0.000 with source swapped.

See also for Rung 1: Attempt 1 in Explainer · Technical detail · Literature: Instruction Hierarchy, ISE

Rung 2: source + operation rail ✅ (done)

Same architecture as Rung 1, with an additional oracle operation rail. Strict exact 1.000 across trusted OBEY, untrusted-OBEY suppression, DATA-USE, and DATA-QUOTE; controls remain meaningful (source-swap 0.000, operation ablation 0.215, OBEY/USE swap 0.438). 9,856 trained parameters.

See also for Rung 2: Attempt 2 in Explainer · Technical detail · Literature: Instruction/Data Separation

Rung 3a: raw policy-bit rail ❌ (failed)

This rung asked the model to learn the lookup permission = policy[operation] from raw policy bits supplied as an additive embedding. The architecture is the same as the source rail, with a policy_bits rail injected at the candidate span; the model has to compose the bits with the attempted operation.

It does not work. After 15,232 additive embedding params, training loss falls from 12.02 to 1.96 but strict exact stays at 0.048 (seen 0.056, held-out 0.000). A tiny adapter overfits a six-mask training table to 1.000 but collapses to 0.261 / 0.219 on fresh seen / held-out rows. Additive bits memorize a small lookup table, they do not learn the reusable binding rule.

The conclusion is structural, not a tuning artifact: a transformer’s frozen weights do not naturally compute the policy[operation] contraction from a uniform additive bias.

See also for Rung 3a: Attempt 3 in Explainer · Technical detail · Literature: Role Confusion

Rung 3b: compiled permission rail ✅ (done)

flowchart TB
  subgraph host["software-side compiler"]
    direction LR
    POLICY["policy bits<br/>[OBEY,USE,QUOTE]<br/>per role"]
    OP_ID["operation id<br/><i>OBEY / USE / QUOTE</i>"]
    LOOKUP{"permission<br/>= policy[operation]"}
    PERM["DEFAULT /<br/>DENIED /<br/>ALLOWED"]
  end
  T["text pieces"]
  W["word lookup<br/>(frozen)"]
  PE["permission lookup<br/>3 rows × 896<br/><b>trained · 2,688 params</b>"]
  SUM(("⊕<br/>at candidate<br/>span only"))
  F["floors → output<br/>(frozen)"]
  Y(("next word"))
  POLICY --> LOOKUP
  OP_ID --> LOOKUP
  LOOKUP --> PERM --> PE
  T --> W --> SUM
  PE --> SUM
  SUM --> F --> Y
  classDef frozen fill:#f3f3f0,stroke:#64676d,color:#18191b;
  classDef trained fill:#e8f4f6,stroke:#28666e,stroke-width:2px,color:#18191b;
  classDef compiler fill:#fff7e6,stroke:#9a6a00,stroke-width:2px,color:#18191b;
  class W,F frozen;
  class PE trained;
  class LOOKUP,PERM compiler;

Move the policy[operation] lookup out of the transformer and into the software stack. The host computes the single permission bit and exposes it through a 3-state rail (default / DENIED / ALLOWED) at the candidate span. Strict exact reaches 1.000 on both seen policies and a held-out OBEY+QUOTE mask with only 2,688 trained parameters — three 896-dim rows.

PR4 stress-tested this rail on a four-cell seen/novel grid (template × source– policy pairing); every cell stays at 1.000, constant-policy trap collapses to 0.444, invert-policy trap collapses to 0.000.

PR5b ported the rail to SEP-style real prompt-injection surfaces. The naive PR5 transfer scored 0.900 (below the 0.95 gate); after one round of paired SEP-surface adaptation, held-out paired SEP reaches 1.000 with OPEN_OBEY 1.000 / DECLINE_OBEY 1.000.

The compiled permission rail is the current working architecture.

See also for Rung 3b (and PR4 / PR5b): Attempt 4 in Explainer · PR4 detail · PR5 detail · Literature: Spotlighting, RepE

Forward proposal: compiled permission rail + provable gate

flowchart TB
  PERM["DEFAULT / DENIED / ALLOWED<br/>(from software compiler)"]
  PE["permission rail"]
  SUM(("⊕"))
  W["word lookup"]
  F["floors"]
  OP["output projection"]
  ALLOW["allowed_words<br/>(set rule)"]
  MASK["gate mask<br/>forbidden = −∞"]
  ARG["argmax"]
  Y(("chosen word<br/>∈ permitted set"))
  PERM --> PE --> SUM
  W --> SUM --> F --> OP -->|raw logits| MASK
  PERM --> ALLOW --> MASK
  MASK -->|masked logits| ARG --> Y
  classDef frozen fill:#f3f3f0,stroke:#64676d,color:#18191b;
  classDef trained fill:#e8f4f6,stroke:#28666e,stroke-width:2px,color:#18191b;
  classDef provable fill:#fff7e6,stroke:#9a6a00,stroke-width:2px,color:#18191b;
  class W,F,OP frozen;
  class PE trained;
  class MASK,ALLOW,ARG provable;

Not yet built. The compiled permission rail at the input still leaves the output’s safety claim trained-not-proved. Adding a deterministic output-time mask keyed on the same permission would close that gap: forbidden decision tokens get −∞ logits, argmax cannot return them, and a Lean soundness theorem can certify the mask without any claim about the floors.

This is an architectural extension, distinct from the project ladder’s “Rung 4” entry (auxiliary rail pretraining). Listed here as the next forward step for the soundness-side claim; experimental rungs PR6 (learned operation detector), PR7 (tiny binder), PR8 (long context), and PR9 (scale) are the in-flight empirical ladder.

See also for the forward gate proposal: Attempt 5 in Explainer · Phase 6 in token journey · Literature: where policy rails sit

Caveat — is the rail really a separate channel?

Partly. At input port and at output gate, the rail is structurally distinct from content. Between Floor 0 and Floor N the rail nudge and content live in the same residual stream, blended by attention and MLP operations. Stronger separation requires one of two design upgrades:

The current source-rail experiments use Option A (single injection at input). Decay across depth is one of the open empirical questions for scale-up.

Caveat — magnitude balance

The rail nudge has a magnitude. Too small, the floors ignore it and alignment fails. Too large, every token tagged with the same source looks identical to the floors regardless of content, and the model loses comprehension. Current source-rail uses a Gaussian init with std 0.02, plus a zeroed DEFAULT row so an untagged input falls back to base-model behavior. The trained nudges settle to small magnitudes — measurable on the residual but not dominant.

Gate soundness theorem (formalisable in Lean)

Let \(\mathrm{tag} \in \mathcal{T}\) be the source tag at the decision position, \(\mathrm{perms}(\mathrm{tag}) \subseteq \mathcal{P}\) the permission set, and \(\mathrm{allowed}(\mathrm{perms}) \subseteq V\) the set of permitted vocabulary tokens. Let \(\ell : V \to \mathbb{R}\) be the raw logits produced by the model at that position.

Define the mask

\[ \mathrm{mask}(w \mid \mathrm{tag}) = \begin{cases} 0 & w \in \mathrm{allowed}(\mathrm{perms}(\mathrm{tag})) \\ -\infty & \text{otherwise} \end{cases} \]

The chosen token is

\[ \hat{w} = \arg\max_{w \in V}\bigl(\ell(w) + \mathrm{mask}(w \mid \mathrm{tag})\bigr). \]

Theorem (soundness). For every \(\mathrm{tag}\) and every logit map \(\ell\),

\[ \hat{w} \in \mathrm{allowed}(\mathrm{perms}(\mathrm{tag})). \]

Proof sketch. If \(w \notin \mathrm{allowed}(\mathrm{perms}(\mathrm{tag}))\), then \(\ell(w) + \mathrm{mask}(w \mid \mathrm{tag}) = -\infty\), which is strictly less than every finite value. The \(\arg\max\) is achieved at some token whose masked logit is finite, and finite masked logits exist only for permitted tokens. Therefore \(\hat{w}\) is permitted. \(\square\)

The theorem does not depend on what the floors learned. It depends only on the mask definition, the order on the extended reals, and the algebra of \(\arg\max\). Usefulness — picking the right answer within the permitted set — remains an empirical property of training.

Current Rung Status

Rung 1: source-only ✅ Rung 2: source + operation ✅ Rung 3a: raw policy bits ❌ Rung 3b: compiled permission rail ✅ PR4: 4-cell grid ✅ PR5: SEP projection early kill PR5b: SEP after adaptation ✅ PR6: learned operation detector ❌ PR7: tiny binder on simple mask ✅ PR7b: tiny binder on PR4 grid ❌ PR8: multi-span boundary ⚠ (initial) PR8b: multi-span fix ✅ PR9 (1.5B): first scale boundary ⚠ PR9b extended steps: regression ❌ PR9c value-delimited surface ✅ (1.5B clears) PR10: synthetic risk-domain rail ✅ Industry safety benchmarks (HarmBench / JailbreakBench / XSTest / WildGuard) — next

The source-only smoke trained only the 6 x 896 additive source embeddings in Qwen2.5-0.5B-Instruct, 5,376 trainable parameters total. The model receives ordinary prompt text plus per-token source ids such as SYSTEM, USER, DATA, and WEB.

Correct source ids 1.000 strict exact match, trusted-follow, and untrusted-suppress.
Constant source control 0.305 strict exact match when source information is removed at eval.
Trusted/untrusted swap 0.000 strict exact match when source ids are swapped at eval.

Interpretation: source labels help and the behavior is causally tied to the supplied source rail. This is still a smoke test. Source says where text came from, not which operation the text is attempting or which policy vector should apply.

The next rungs are now also run. A source+operation rail reaches strict exact 1.000 with 9,856 trainable parameters. But a raw policy-bit vector does not install a reusable binding rule: the full run lands at 0.048 exact, and tiny overfits memorize training rows without transferring to fresh rows. The working variant compiles the policy vector first:

permission_for_candidate = policy_bits[operation_id]

When that ALLOWED/DENIED permission rail is injected locally on candidate spans, strict exact reaches 1.000 on both seen policy masks and the held-out OBEY+QUOTE mask. The sharper minimality test disables source, operation, and raw policy-bit embeddings; the permission rail alone still reaches 1.000 with only 2,688 trainable parameters.

PR4: 4-cell grid ✅ (done)

The same rail re-evaluated on a seen × novel template crossed with seen × novel source–policy pairing. Every cell holds at 1.000; constant-policy trap collapses to 0.444; invert-policy trap collapses to 0.000. The rail is causally responsible for the gate behavior, not a label shortcut.

See also for PR4: Technical PR4 · Attempt 4 in Explainer · Literature: BIPIA-style transfer

PR5 → PR5b: SEP-style benchmark projection ✅ (after adaptation)

Loading the PR4 adapter straight onto 200 SEP-style denied-OBEY rows gave 0.900 — below the 0.95 gate, so PR5 was an early kill. After one round of paired SEP-surface adaptation (same visible text twice; DENIED → ANSWER, ALLOWED → witness), held-out paired SEP reached 1.000 with OPEN_OBEY 1.000 and DECLINE_OBEY 1.000. Benchmark-surface transfer required adaptation; the rail’s causal behavior survives.

See also for PR5 / PR5b: Technical PR5 · Attempt 4 in Explainer · Literature: BIPIA, SEP transfer

PR6: learned operation detector ❌ (early-killed)

Early-killed at the preflight gate. A frozen-Qwen hidden-state linear probe over the operation-labeled candidate span fits seen PR4 templates at 1.000 but drops to 0.615 on held-out templates. The detector picks up real template signal (shuffled-label trap 0.380) but is not template-invariant enough to feed the permission rail. Today the concierge still needs an externally-supplied attempted-operation label.

See also for PR6: Blind Avenues · Literature: Role Confusion

PR7: tiny binder on simple lookup ✅ (done)

A 29,792-parameter MLP binder over [policy_bits, operation_onehot] rescues the raw policy-bit failure. On the original held-out policy-mask task it reaches exact 1.000 on seen masks and the held-out 101 mask, with constant-policy 0.429 and invert-policy 0.000. So a small architecturally-explicit learned compiler can compute the lookup that the additive-bias formulation could not — Rung 3a was a formulation problem, not a capacity ceiling.

See also for PR7: Technical PR7 · Blind Avenues · Literature: LoRA family

PR7b: tiny binder on PR4 grid ❌ (template fragility)

Re-running the same learned binder on the PR4 source-policy × template grid: installs on seen templates (C1 / C3 = 1.000) but fails held-out templates at C2 = 0.448 and C4 = 0.438. The learned compiler is therefore template-fragile in the way the software compiler is not. Production path stays on the software-compiled rail; PR8 reuses the software compiler precisely because PR7b ruled out the learned binder as the strongest compiler.

See also for PR7b: Technical PR7b · Blind Avenues

PR8 / PR8b: multi-span oracle compiler ✅ (fixed-step pass)

PR7b ruled out the learned binder as the strongest compiler, so PR8 keeps the software-compiled permission rail and tests it on a harder corpus: one primary candidate plus a distractor span. The first 300-step run remained causal but missed held-out multi-span templates: exact 0.965, C2 = 0.938, C4 = 0.917, invert-policy 0.003.

PR8b fixed the endpoint at 200 steps, enlarged the held-out-value eval set, and added error diagnostics. It clears the gate: exact 0.989, C2 = 0.982, C4 = 0.969, distractor error 0.000, invert-policy 0.002. The failure was not wrong-span bleed. PR9 (scale and architecture replication) is now in progress; it later exposed a value-boundary issue that PR9c fixed.

See also for PR8: Technical PR8 · Blind Avenues · Literature: BIPIA, multi-span surfaces

See also for PR8b: Technical PR8b · Explainer Attempt 4 · Literature: instruction/data separation

PR9: scale replication on Qwen2.5-1.5B ✅ (value-boundary fix)

Repeat the PR8b fixed-step multi-span protocol on Qwen2.5-1.5B-Instruct. Only the local permission rail trains: 3 × 1536 = 4,608 parameters (no LoRA, no source / operation / policy-bit embeddings). The rail remains causal and span-bound, but does not clear the all-cell gate at the same 200-step budget.

Metric Qwen2.5-0.5B (PR8b) Qwen2.5-1.5B (PR9)
strict exact 0.989 0.948
C1 seen × seen 1.000 0.979
C2 seen × held-out template 0.982 0.917
C3 held-out source-policy × seen 1.000 0.979
C4 held-out × held-out 0.969 0.917
OPEN_OBEY 1.000 0.813
DECLINE_OBEY 1.000 1.000
constant-policy trap 0.444 0.444
invert-policy trap 0.002 0.017
distractor errors 0.000 0.000

The weak axis is held-out templates on the OPEN side. A sample-matched PR9b run ruled out “just train longer”: by step 300, C4 fell to 0.760 while the invert trap reached 0.000. The concrete error was copying the full carrier phrase instead of returning the bare value — sample exposure pushed the 1.5B model into a template-format overfit, not a real-rail failure.

PR9c fixes this by making value boundaries explicit (VALUE=..., [...], <value>...</value>). On the same 1.5B model and the same rail, PR9c reaches exact 1.000 on all four cells and every OPEN/DECLINE primitive; constant-policy remains 0.444, invert-policy is 0.000, every error-type is 0.000. The 0.5B → 1.5B scale boundary was an output-extraction ambiguity, not a capacity ceiling. PR10 risk-domain rails are now tested, preserving explicit policy/value boundaries in the candidate-span format.

See also for PR9 / PR9b / PR9c: Technical PR9 · Literature: BIPIA scale transfer · REPRODUCE.md commands · postmortem: rope-provenance/docs/pr9_postmortem.md

PR10: synthetic risk-domain rail ✅

PR10 adds a separate local risk rail to the PR9c value-delimited permission stack. Permission still decides whether the attempted operation is allowed. Risk then decides whether an otherwise-allowed candidate should be refused.

permission DENIED -> ANSWER
permission ALLOWED + SAFE/SENSITIVE -> candidate value
permission ALLOWED + HARMFUL -> REFUSE

The trainable surface is 10,752 parameters: 4,608 for permission and 6,144 for risk. Correct eval reaches exact 0.995 over 864 rows; C1/C2/C3/C4 are all at least 0.993; risk-allow is 0.988; risk-refuse is 1.000; permission-decline is 1.000; distractor error is 0.000.

The trap shape matters. Invert-policy collapses to 0.005. Invert-risk collapses to 0.444. Constant-risk stays at 0.773 because removing risk should preserve permission-denied fallback cases and many benign allow cases, while breaking harmful refusals.

This is a synthetic risk result, not a HarmBench / JailbreakBench / XSTest / WildGuard result. It says the typed rail can carry a minimal risk attribute without corrupting permission or span routing.

See also for PR10: Technical PR10 · REPRODUCE.md commands · postmortem: rope-provenance/docs/pr10_postmortem.md

Blind Avenues

The honest gap inventory. These are rungs the microsite’s positive claims do not cover. The same inventory is mirrored on the Explainer’s Blind Avenues section.

Open question Status Stake
Learned operation detector (PR6) ❌ early kill Production: today, the attempted operation is still supplied by oracle.
PR6-RL detector retry future Trap-pair reward only for the detector/concierge layer; the lookup itself stays code.
Tiny binder on simple lookup (PR7) ✅ positive A learned compiler can compute policy[operation] when architecturally explicit. Rescues the Rung 3a additive-bias failure.
Tiny binder on PR4 grid (PR7b) ❌ template-fragile Learned binder installs on seen templates only; software-compiled rail remains the strongest path for PR8+.
Multi-span boundary (PR8) ⚠ boundary First 300-step run was causal but weak on held-out templates.
Multi-span fix (PR8b) ✅ clears gate Fixed 200-step protocol reaches exact 0.989, C4 0.969, distractor error 0.000.
Scale replication (PR9, 1.5B Qwen) ⚠ first boundary Direct PR8b surface scored 0.948; causal rail, no distractor bleed, but held-out OPEN formatting failed.
Value-boundary scale fix (PR9c) ✅ clears gate Explicit value boundaries give 1.000 on all cells and traps collapse cleanly.
Risk-domain rails (PR10) ✅ synthetic pass Adds SAFE/SENSITIVE/HARMFUL as a separate local rail; correct 0.995, invert-risk 0.444, distractor error 0.000.
Provable output mask (Attempt 5 in Explainer) not built; conceptual Soundness moves from trained to provable. No Lean theorem written yet for this rail.
Host-side tagging discipline engineering, not yet hardened The whole story collapses if a chunk is mislabeled at the boundary.

What Would Change My Mind

Six concrete outcomes that would weaken or refute the architecture’s core claim, in order of how decisive they would be.

  1. Adversarial override at 0.5B or 1.5B. A reproducible class of natural-language inputs that, when processed by the frozen model, reliably produces OPEN[P] outputs at the decision token when the compiler injected DENIED — at >10% rate on a structured adversarial eval — would weaken the robustness claim. Concrete enough to test on the public adapter; treated as co-author-worthy if reproduced. See PR8b / PR9c.
  2. Non-Qwen architecture replication failure. If the same protocol on Llama 3, Mistral, or Gemma at the 0.5B–2B range fails to clear 0.95 on the PR4 grid, the mechanism is Qwen-specific in a way that limits safety relevance. The PR9c value-boundary dependence is already a signal that text-surface details matter; cross-family transfer is the next falsifier.
  3. Capability tax at standard benchmarks. If the 2,688-parameter rail degrades MMLU, HumanEval, or IFEval by >2 percentage points relative to the base model on a like-for-like comparison, the capability-tax-free framing does not hold and the architecture has to absorb that cost in its claim scope.
  4. Causal-mediation analysis localising the rail to a non-attention pathway. If careful patching reveals the rail’s effect is mediated by a single late-layer MLP feature rather than by changes in attention-pattern routing over the candidate span, the mechanistic hypotheses on the Explainer need rewriting and the rail’s transferability story tightens.
  5. PR6 stays below 0.85 after a methodological retry. Specifically, if a learned operation detector trained with diverse template generation and explicit negative examples (or RL on trap-pair rewards) cannot beat 0.85 held-out template accuracy, the oracle-operation_id dependency moves from “transitional gap” to “structural blocker,” and any deployment-readiness framing has to retract. See PR6.
  6. PR8b / PR9c multi-seed fragility. All current rungs are seed-0. A multi-seed replication is on the immediate to-do list; if more than one seed in five fails the all-cell gate at the published step budget, the optimization is more fragile than reported.

These map 1:1 to the Blind Avenues table above: each falsifier corresponds to one open question, and a positive result on any falsifier (i.e., the bad outcome) collapses the matching row in the table.

The next compositional grid also passes. PR4 crosses seen versus held-out source-policy pairings with seen versus held-out candidate templates. The same 2,688-parameter permission-only rail reaches 1.000 in all four cells, and a strict invert-policy trap collapses to 0.000. That is the cleanest constructive result so far: the model can robustly consume a compiled local permission rail.

The current boundary is no longer SEP projection, short multi-span, 1.5B scale under explicit value boundaries, or synthetic risk routing. PR5b showed SEP adaptation, PR8b showed two-span routing, PR9c showed the 1.5B model works when the value boundary is part of the interface, and PR10 showed that a separate risk rail can be added without corrupting the permission rail. The next rung is external benchmark projection and operation-detector retry, not more prompt formatting.

Pages

Relation To Model Internals

This is a sibling track to the model-internals microsite, not a replacement for it. Both tracks ask the same alignment-engineering question: can we make a model reliably distinguish what text says from what that text is allowed to do?

The difference is where the burden is placed.

Track Burden placed on Main finding
Model internals The model learns role-to-permission structure from text-side training. Small gates can be installed, but held-out role/permission combinations did not compose reliably.
Policy rails The software stack supplies typed policy state, and the model learns to use the local rail. Raw policy bits still failed, but a compiled local permission rail reached 1.000.

So the model-internals track maps the boundary of what fine-tuning alone can make the model internalize. Policy rails are the engineering response: keep the useful primitive vocabulary, but compile policy decisions into a typed side channel that the model can consume locally.