Typed Policy Rails

Literature Survey

How typed policy rails connect to prior and concurrent work.

This page positions the typed-policy-rails program against the surrounding literature. Each entry is a short summary of the prior work plus a Related rungs line that links into the corresponding sections on the Overview, Explainer, and Technical pages. The bidirectional anchors land on stable rung ids.

The goal is honesty about lineage: nothing here is “first to think about provenance in language models”. The contribution is a particular combination — local-span, frozen-weights, software-compiled, deterministic gate — that other lines of work have not yet built together.

Representation Engineering and Activation Steering

Zou et al. 2023 (“Representation Engineering”) and the broader activation- steering literature show that you can contrast two datasets (for example honest vs dishonest statements), extract a direction in activation space that separates them, and add that direction back at inference time to bias behaviour.

Three properties of that line of work matter for positioning:

The typed-policy-rails track shares the underlying mechanism — small additive interventions in residual space — but moves the lookup out of the model (a software concierge), narrows the injection to the candidate span only, and proposes a downstream deterministic mask at the decision token to convert soundness from a learned property into a provable one.

Related rungs: Rung 1 source rail · Rung 3a kill · Rung 3b compiled rail · Attempt 5 forward gate

Instruction Hierarchy

Wallace et al. 2024 (“The Instruction Hierarchy”, OpenAI) trains a model to treat instructions from higher-priority sources as more authoritative than instructions appearing in lower-priority text. This is the closest training framing for source-aware behaviour and is what most production models currently rely on for resisting prompt injection.

Two limits relevant to this project:

Policy rails are not a replacement; they are an attempt to push the same intent into a typed side channel where the lookup is deterministic and the gate can be inspected independently of training.

Related rungs: Rung 1 · Rung 2 · Rung 3b compiled rail

Spotlighting

Hines et al. 2024 (“Spotlighting”) makes the explicit case for supplying control text and data text through separate channels rather than a single in-band prompt string. The architectural premise of policy rails is this same out-of-band-channel argument, instantiated as typed metadata rather than as a textual tag scheme.

Related rungs: Rails, not more prompt formatting · Where Does the Safety Claim Live?

Instruction/Data Separation Training

A cluster of recent work trains models to treat instructions appearing inside data spans as non-authoritative:

The design-space table below positions these against each other.

Design-space table

  Input-layer Per-layer
Additive ISE, source rail (this project, Rung 1) AIR
Rotational ASIDE RoPE-Provenance (kill)

Policy rails extend this design space along a third axis the prior work has not been combining: software-compiled local permission injected only at the candidate span, with the lookup permission = policy[operation] deterministically computed outside the model. That third axis is what makes the 2,688-parameter permission rail at Rung 3b reach 1.000 where the raw policy-bit rail at Rung 3a (a flat additive bias) cannot.

Related rungs: Rung 1 · Rung 3a · Rung 3b · Technical IR shape

Role Confusion

Ye, Cui, and Hadfield-Menell 2026 (“Role Confusion”) provides the mechanistic motivation for why naive text-side role labels do not control behaviour: role and authority are encoded in latent space, where stylistic features can dominate any formal interface label the developer supplies.

This result directly explains the Rung 3a kill. A raw policy-bit vector added as a uniform residual bias is a formal interface label; the model’s in-weights computation falls back on the stylistic cues it learned during pretraining, and the formal label is ignored. The compiled permission rail at Rung 3b avoids this by making the result of the lookup the only thing the model has to read.

Related rungs: Rung 3a kill · Attempt 3 in Explainer · PR6 detector kill

Indirect Prompt-Injection Benchmarks

Yi et al. 2023/2024 (“BIPIA”) supplies the indirect-prompt-injection benchmark setting: external content (web pages, tool outputs, retrieved documents) carries malicious instructions that models execute unless the source boundary is represented and trained for. Related benchmarks include SEP, TensorTrust, and PromptInject/InjecTQA.

These benchmarks are the natural transfer target for the policy-rails program. PR5b ported the rail to SEP-style surfaces with one adaptation round. PR8b passes a multi-span structure that mirrors retrieval and tool-output settings. PR9 (first scale on Qwen2.5-1.5B) was a boundary at 0.948; PR9b ruled out the “train longer” hypothesis by regressing under extended exposure; PR9c clears the 1.5B scale rung at 1.000 across all cells by making the candidate-value boundary explicit in the text surface. The 0.5B → 1.5B boundary was an output-extraction ambiguity, not a capacity ceiling. PR10 adds a separate synthetic risk-domain rail (SAFE / SENSITIVE / HARMFUL) on top of the value- delimited 1.5B stack and clears at exact 0.995 with proper trap shape (invert-policy 0.005, invert-risk 0.444, distractor errors 0.000). HarmBench / JailbreakBench / XSTest / WildGuard / TensorTrust / BIPIA remain future work as the external benchmark projection.

Related rungs: PR5 / PR5b · PR8 (initial) · PR8b (fix) · PR9 (boundary) · PR9b (regression) · PR9c (1.5B clears) · PR10 (synthetic risk rail)

Parameter-Efficient Tuning Substrate

The trained surface of every rung is a small additive table (typically a few thousand parameters) over a frozen base model. The parameter-efficient tuning literature is the relevant substrate:

Typed policy rails sit at the small end of this spectrum. The compiled permission rail at Rung 3b uses 2,688 trained parameters — three rows of 896 dims — which is substantially below standard LoRA configurations.

Related rungs: Rung 1 · Rung 3b · Technical training detail

Where Policy Rails Sit

The combination this project tries to build is:

Dimension Prior work pattern Typed policy rails
Where the lookup runs In the model’s in-weights computation. In deterministic host code outside the model.
Injection scope Global (every token, often every layer). Local — at the candidate span only.
Trained surface Full model fine-tune, or LoRA-scale adapters. A 2,688-parameter permission embedding table over a frozen base.
Soundness claim Learned behaviour; statistical. Trained today; intended to become a provable deterministic mask at Attempt 5.
Output enforcement Argmax over unconstrained logits. Argmax over logits with forbidden decision tokens masked to −∞ (forward proposal).
Benchmark transfer Direct (SEP, BIPIA, etc.). PR5b passes SEP with one adaptation round; PR8b passes multi-span; PR9c clears Qwen2.5-1.5B scale; PR10 synthetic risk-rail clears at 0.995; external HarmBench / JailbreakBench / XSTest / WildGuard projection is the next rung.

The novelty is not any individual ingredient. Each piece — out-of-band channel (Hines), source labelling (Wallace), substring instruction/data separation (StruQ / ISE / ASIDE / AIR), parameter-efficient tuning (LoRA) — already exists. The combination is the contribution: a software-compiled local permission rail injected at the candidate span, evaluated under trap-and-deepen discipline, with a provable output mask as the soundness end-state.

Related rungs: Architecture overview · Five Attempts · Technical IR shape

Source Trace

Notes here are condensed from rope-provenance/docs/literature.md and the project’s running session logs. Bibliographic detail and per-paper citation keys live there; this page surfaces the connections to rungs without duplicating the bibliography.