Literature Survey
How typed policy rails connect to prior and concurrent work.
This page positions the typed-policy-rails program against the surrounding literature. Each entry is a short summary of the prior work plus a Related rungs line that links into the corresponding sections on the Overview, Explainer, and Technical pages. The bidirectional anchors land on stable rung ids.
The goal is honesty about lineage: nothing here is “first to think about provenance in language models”. The contribution is a particular combination — local-span, frozen-weights, software-compiled, deterministic gate — that other lines of work have not yet built together.
Representation Engineering and Activation Steering
Zou et al. 2023 (“Representation Engineering”) and the broader activation- steering literature show that you can contrast two datasets (for example honest vs dishonest statements), extract a direction in activation space that separates them, and add that direction back at inference time to bias behaviour.
Three properties of that line of work matter for positioning:
- Global scope. The steering vector is added to every token (often at every layer). This is closer to perfume in the building’s air than to a badge at a specific door.
- Utility tax. Global addition shifts behaviour on unrelated prompts, not just the targeted axis. Perplexity and downstream capability degrade.
- Statistical, not provable. The model is still doing the lookup internally; an adversarial prompt can slide past.
The typed-policy-rails track shares the underlying mechanism — small additive interventions in residual space — but moves the lookup out of the model (a software concierge), narrows the injection to the candidate span only, and proposes a downstream deterministic mask at the decision token to convert soundness from a learned property into a provable one.
Related rungs: Rung 1 source rail · Rung 3a kill · Rung 3b compiled rail · Attempt 5 forward gate
Instruction Hierarchy
Wallace et al. 2024 (“The Instruction Hierarchy”, OpenAI) trains a model to treat instructions from higher-priority sources as more authoritative than instructions appearing in lower-priority text. This is the closest training framing for source-aware behaviour and is what most production models currently rely on for resisting prompt injection.
Two limits relevant to this project:
- The hierarchy is coarse — message-level priority rather than arbitrary substring-level provenance.
- Enforcement is learned in-weights. The hierarchy lives inside the model’s trained behaviour, not as an architectural channel that can be inspected or proved correct.
Policy rails are not a replacement; they are an attempt to push the same intent into a typed side channel where the lookup is deterministic and the gate can be inspected independently of training.
Related rungs: Rung 1 · Rung 2 · Rung 3b compiled rail
Spotlighting
Hines et al. 2024 (“Spotlighting”) makes the explicit case for supplying control text and data text through separate channels rather than a single in-band prompt string. The architectural premise of policy rails is this same out-of-band-channel argument, instantiated as typed metadata rather than as a textual tag scheme.
Related rungs: Rails, not more prompt formatting · Where Does the Safety Claim Live?
Instruction/Data Separation Training
A cluster of recent work trains models to treat instructions appearing inside data spans as non-authoritative:
- StruQ — Chen et al. 2024/2025. Splits prompt and data portions and fine-tunes the model to follow instructions only from the prompt portion. Closest training-data predecessor to the kind of paired-counterfactual corpora typed rails consume.
- ISE — Wu et al. 2024 (“Instructional Segment Embedding”). Input-layer additive segment-style signal. Same channel level as the source rail but with a flat segment-type instead of a typed source/operation IR.
- ASIDE — Zverev et al. 2026. Input-layer rotational variant of instruction/data separation; closest rotational lineage to the rope-provenance experiments that feed this program.
- AIR — Kariyappa & Suh 2025. Per-layer additive instruction-hierarchy injection. The per-layer cousin of the input-layer additive rails.
The design-space table below positions these against each other.
Design-space table
| Input-layer | Per-layer | |
|---|---|---|
| Additive | ISE, source rail (this project, Rung 1) | AIR |
| Rotational | ASIDE | RoPE-Provenance (kill) |
Policy rails extend this design space along a third axis the prior work has
not been combining: software-compiled local permission injected only at
the candidate span, with the lookup permission = policy[operation]
deterministically computed outside the model. That third axis is what makes
the 2,688-parameter permission rail at Rung 3b reach 1.000 where the raw
policy-bit rail at Rung 3a (a flat additive bias) cannot.
Related rungs: Rung 1 · Rung 3a · Rung 3b · Technical IR shape
Role Confusion
Ye, Cui, and Hadfield-Menell 2026 (“Role Confusion”) provides the mechanistic motivation for why naive text-side role labels do not control behaviour: role and authority are encoded in latent space, where stylistic features can dominate any formal interface label the developer supplies.
This result directly explains the Rung 3a kill. A raw policy-bit vector added as a uniform residual bias is a formal interface label; the model’s in-weights computation falls back on the stylistic cues it learned during pretraining, and the formal label is ignored. The compiled permission rail at Rung 3b avoids this by making the result of the lookup the only thing the model has to read.
Related rungs: Rung 3a kill · Attempt 3 in Explainer · PR6 detector kill
Indirect Prompt-Injection Benchmarks
Yi et al. 2023/2024 (“BIPIA”) supplies the indirect-prompt-injection benchmark setting: external content (web pages, tool outputs, retrieved documents) carries malicious instructions that models execute unless the source boundary is represented and trained for. Related benchmarks include SEP, TensorTrust, and PromptInject/InjecTQA.
These benchmarks are the natural transfer target for the policy-rails program. PR5b ported the rail to SEP-style surfaces with one adaptation round. PR8b passes a multi-span structure that mirrors retrieval and tool-output settings. PR9 (first scale on Qwen2.5-1.5B) was a boundary at 0.948; PR9b ruled out the “train longer” hypothesis by regressing under extended exposure; PR9c clears the 1.5B scale rung at 1.000 across all cells by making the candidate-value boundary explicit in the text surface. The 0.5B → 1.5B boundary was an output-extraction ambiguity, not a capacity ceiling. PR10 adds a separate synthetic risk-domain rail (SAFE / SENSITIVE / HARMFUL) on top of the value- delimited 1.5B stack and clears at exact 0.995 with proper trap shape (invert-policy 0.005, invert-risk 0.444, distractor errors 0.000). HarmBench / JailbreakBench / XSTest / WildGuard / TensorTrust / BIPIA remain future work as the external benchmark projection.
Related rungs: PR5 / PR5b · PR8 (initial) · PR8b (fix) · PR9 (boundary) · PR9b (regression) · PR9c (1.5B clears) · PR10 (synthetic risk rail)
Parameter-Efficient Tuning Substrate
The trained surface of every rung is a small additive table (typically a few thousand parameters) over a frozen base model. The parameter-efficient tuning literature is the relevant substrate:
- LoRA — Hu et al. 2021. Freezes pretrained weights and inserts low-rank trainable updates. The right first adaptation method for cheap early-kill testing.
- QLoRA — Dettmers et al. 2023. Supports the same adapter-first strategy under 4-bit quantization. Useful at 7B+ if PR9 needs to fit on consumer hardware.
- DoRA — Liu et al. 2024. Separates weight magnitude and direction; reserved for later stabilization if a rung underfits under LoRA.
Typed policy rails sit at the small end of this spectrum. The compiled permission rail at Rung 3b uses 2,688 trained parameters — three rows of 896 dims — which is substantially below standard LoRA configurations.
Related rungs: Rung 1 · Rung 3b · Technical training detail
Where Policy Rails Sit
The combination this project tries to build is:
| Dimension | Prior work pattern | Typed policy rails |
|---|---|---|
| Where the lookup runs | In the model’s in-weights computation. | In deterministic host code outside the model. |
| Injection scope | Global (every token, often every layer). | Local — at the candidate span only. |
| Trained surface | Full model fine-tune, or LoRA-scale adapters. | A 2,688-parameter permission embedding table over a frozen base. |
| Soundness claim | Learned behaviour; statistical. | Trained today; intended to become a provable deterministic mask at Attempt 5. |
| Output enforcement | Argmax over unconstrained logits. | Argmax over logits with forbidden decision tokens masked to −∞ (forward proposal). |
| Benchmark transfer | Direct (SEP, BIPIA, etc.). | PR5b passes SEP with one adaptation round; PR8b passes multi-span; PR9c clears Qwen2.5-1.5B scale; PR10 synthetic risk-rail clears at 0.995; external HarmBench / JailbreakBench / XSTest / WildGuard projection is the next rung. |
The novelty is not any individual ingredient. Each piece — out-of-band channel (Hines), source labelling (Wallace), substring instruction/data separation (StruQ / ISE / ASIDE / AIR), parameter-efficient tuning (LoRA) — already exists. The combination is the contribution: a software-compiled local permission rail injected at the candidate span, evaluated under trap-and-deepen discipline, with a provable output mask as the soundness end-state.
Related rungs: Architecture overview · Five Attempts · Technical IR shape
Source Trace
Notes here are condensed from rope-provenance/docs/literature.md and the
project’s running session logs. Bibliographic detail and per-paper citation
keys live there; this page surfaces the connections to rungs without
duplicating the bibliography.