Literature Survey

How typed policy rails connect to prior and concurrent work.

This page positions the typed-policy-rails program against the surrounding literature. Each entry is a short summary of the prior work plus a Related rungs line that links into the corresponding sections on the Overview, Explainer, and Technical pages. The bidirectional anchors land on stable rung ids.

The goal is honesty about lineage: nothing here is “first to think about provenance in language models”. The contribution is a particular combination — enforcement locus (residual-space permission bias injected at the candidate span, not the tool-call boundary), a deterministic software compiler that owns the lookup, extreme parameter economy (~2,688 params over a frozen base), and the source × operation factorization — that other lines of work have not yet assembled together.

Representation Engineering and Activation Steering

Zou et al. 2023 (“Representation Engineering”) and the broader activation- steering literature show that you can contrast two datasets (for example honest vs dishonest statements), extract a direction in activation space that separates them, and add that direction back at inference time to bias behaviour.

Three properties of that line of work matter for positioning:

Global scope. The steering vector is added to every token (often at every layer). This is closer to perfume in the building’s air than to a badge at a specific door.
Utility tax. Global addition shifts behaviour on unrelated prompts, not just the targeted axis. Perplexity and downstream capability degrade.
Statistical, not provable. The model is still doing the lookup internally; an adversarial prompt can slide past.

The typed-policy-rails track shares the underlying mechanism — small additive interventions in residual space — but moves the lookup out of the model (a software concierge), narrows the injection to the candidate span only, and proposes a downstream deterministic mask at the decision token to convert soundness from a learned property into a provable one.

Instruction Hierarchy

Wallace et al. 2024 (“The Instruction Hierarchy”, OpenAI) trains a model to treat instructions from higher-priority sources as more authoritative than instructions appearing in lower-priority text. This is the closest training framing for source-aware behaviour and is what most production models currently rely on for resisting prompt injection.

Two limits relevant to this project:

The hierarchy is coarse — message-level priority rather than arbitrary substring-level provenance.
Enforcement is learned in-weights. The hierarchy lives inside the model’s trained behaviour, not as an architectural channel that can be inspected or proved correct.

Policy rails are not a replacement; they are an attempt to push the same intent into a typed side channel where the lookup is deterministic and the gate can be inspected independently of training.

Related rungs: Rung 1 · Rung 2 · Rung 3b compiled rail

Spotlighting

Hines et al. 2024 (“Spotlighting”) makes the explicit case for supplying control text and data text through separate channels rather than a single in-band prompt string. The architectural premise of policy rails is this same out-of-band-channel argument, instantiated as typed metadata rather than as a textual tag scheme.

Instruction/Data Separation Training

A cluster of recent work trains models to treat instructions appearing inside data spans as non-authoritative:

StruQ — Chen et al. 2024/2025. Splits prompt and data portions and fine-tunes the model to follow instructions only from the prompt portion. Closest training-data predecessor to the kind of paired-counterfactual corpora typed rails consume.
ISE — Wu et al. 2024 (“Instructional Segment Embedding”). Input-layer additive segment-style signal. Same channel level as the source rail but with a flat segment-type instead of a typed source/operation IR.
ASIDE — Zverev et al. 2026. Input-layer rotational variant of instruction/data separation; closest rotational lineage to the role-provenance experiments that feed this program.
AIR — Kariyappa & Suh 2025 (2505.18907, “Augmented Intermediate Representations”). Per-layer additive instruction-hierarchy injection — injects a learned representation at every transformer block, not just at input. Closest architecture-level predecessor to a per-layer policy rail. The key difference: AIR’s injection is a trained learned representation of the hierarchy level, whereas policy rails inject a compiled software-decided permission state (DEFAULT/DENIED/ALLOWED) at the candidate span only.

The design-space table below positions these against each other.

Design-space table

	Input-layer	Per-layer
Additive	ISE, source rail (this project, Rung 1)	AIR
Rotational	ASIDE	RoPE-Provenance (kill)

Policy rails extend this design space along a third axis the prior work has not been combining: software-compiled local permission injected only at the candidate span, with the lookup permission = policy[operation] deterministically computed outside the model. That third axis is what makes the 2,688-parameter permission rail at Rung 3b reach 1.000 where the raw policy-bit rail at Rung 3a (a flat additive bias) cannot.

Related rungs: Rung 1 · Rung 3a · Rung 3b · Technical IR shape

Role Confusion

Ye, Cui, and Hadfield-Menell 2026 (“Role Confusion”) provides the mechanistic motivation for why naive text-side role labels do not control behaviour: role and authority are encoded in latent space, where stylistic features can dominate any formal interface label the developer supplies.

This result directly explains the Rung 3a kill. A raw policy-bit vector added as a uniform residual bias is a formal interface label; the model’s in-weights computation falls back on the stylistic cues it learned during pretraining, and the formal label is ignored. The compiled permission rail at Rung 3b avoids this by making the result of the lookup the only thing the model has to read.

Related rungs: Rung 3a kill · Rung 3a in Explainer · PR6 detector kill

Indirect Prompt-Injection Benchmarks

Yi et al. 2023/2024 (“BIPIA”) supplies the indirect-prompt-injection benchmark setting: external content (web pages, tool outputs, retrieved documents) carries malicious instructions that models execute unless the source boundary is represented and trained for. Related benchmarks include SEP, TensorTrust, and PromptInject/InjecTQA.

These benchmarks are the natural transfer target for the policy-rails program. PR5b ported the rail to SEP-style surfaces with one adaptation round. PR8b passes a multi-span structure that mirrors retrieval and tool-output settings. PR9 (first scale on Qwen2.5-1.5B) was a boundary at 0.948; PR9b ruled out the “train longer” hypothesis by regressing under extended exposure; PR9c clears the 1.5B scale rung at 1.000 across all cells by making the candidate-value boundary explicit in the text surface. The 0.5B → 1.5B boundary was an output-extraction ambiguity, not a capacity ceiling. PR10 adds a separate synthetic risk-domain rail (SAFE / SENSITIVE / HARMFUL) on top of the value- delimited 1.5B stack and clears at exact 0.995 with proper trap shape (invert-policy 0.005, invert-risk 0.444, distractor errors 0.000). HarmBench / JailbreakBench / XSTest / WildGuard / TensorTrust / BIPIA remain future work as the external benchmark projection.

Agent Security Frameworks

A cluster of recent work focuses on defending LLM agents and tool-using pipelines from prompt injection at the system architecture level rather than the model training level.

CaMeL (2025). Moves trust enforcement to a capability-aware execution runtime: an outer supervisor model decides which tool calls an inner model may make, so injected instructions in tool outputs cannot escalate privileges. Enforcement locus is the tool-call boundary — the supervisor intercepts at the API level.
Progent (2025). Associates least-privilege policy graphs with agent execution. Policy enforcement is again external to the model, expressed as structured permission contracts over tool-call sequences. Enforcement locus is the tool permission graph, not the residual stream.
Meta SecAlign — Zeng et al. 2025 (2507.02735). Fine-tunes alignment on instruction-following and prompt-injection resistance jointly. Enforcement is in-weights (trained behavior), not architectural.
AgentSecBench (2026). A structured benchmark for agentic prompt injection. Sets the bar for adversarial multi-step injection in realistic pipelines; this is the right external evaluation target for a mature policy-rails deployment, beyond the scoped single-turn SEP setting.

The policy-rails architecture is positioned differently from all three frameworks. CaMeL and Progent enforce at the tool-call boundary, with an explicit external trust controller. That is structurally complementary: a policy-rail binder decides whether the model responds to a span’s content at all, upstream of whether the resulting tool call is permitted. SecAlign treats enforcement as a training-time property; policy rails keep the base model frozen and instead teach it to consume a compiled side-channel signal.

The scope claim should be read carefully: policy rails eliminate the category of the scoped software setting (a Qwen 0.5B/1.5B model, synthetic and SEP surfaces, with the operation label oracle-supplied). Adaptive adversaries on real-world agentic pipelines — as measured by AgentDojo, LLMail-Inject (2506.09956), and AgentSecBench — set a substantially higher bar. Claiming prompt-injection elimination in that broader sense would be wrong; what the ladder currently supports is “a software-compiled permission signal reliably steers the model’s fallback behavior on the tested surfaces.”

Residual Fusion and Layer-Injection Variants

DRIP — Residual-fusion instruction protection, 2511.00447. Injects a protection signal into intermediate residual-stream layers rather than at input. The per-layer injection is closer to Option B in the rail architecture caveat (re-adding the rail nudge before each block). DRIP focuses on injection robustness across depth; this project currently uses single input-layer injection (Option A) and leaves per-layer injection as a known upgrade path for deeper models. DRIP’s empirical results provide a useful reference for the depth-decay question the policy-rails track has not yet measured.

Parameter-Efficient Tuning Substrate

The trained surface of every rung is a small additive table (typically a few thousand parameters) over a frozen base model. The parameter-efficient tuning literature is the relevant substrate:

LoRA — Hu et al. 2021. Freezes pretrained weights and inserts low-rank trainable updates. The right first adaptation method for cheap early-kill testing.
QLoRA — Dettmers et al. 2023. Supports the same adapter-first strategy under 4-bit quantization. Useful at 7B+ if PR9 needs to fit on consumer hardware.
DoRA — Liu et al. 2024. Separates weight magnitude and direction; reserved for later stabilization if a rung underfits under LoRA.

Typed policy rails sit at the small end of this spectrum. The compiled permission rail at Rung 3b uses 2,688 trained parameters — three rows of 896 dims — which is substantially below standard LoRA configurations.

Related rungs: Rung 1 · Rung 3b · Technical training detail

Where Policy Rails Sit

The combination this project tries to build is:

Dimension	Prior work pattern	Typed policy rails
Enforcement locus	Tool-call boundary (CaMeL, Progent) or in-weights training (SecAlign, IH, StruQ).	Residual stream at the candidate span — upstream of tool dispatch, downstream of raw text.
Where the lookup runs	In the model’s in-weights computation, or in an external supervisor at the API layer.	In deterministic host code outside the model, compiled into a 3-state signal (DEFAULT/DENIED/ALLOWED).
Injection scope	Global (every token, often every layer); or API-level (tool boundary only).	Local — at the candidate span only.
Trained surface	Full model fine-tune, or LoRA-scale adapters; or supervisor model trained separately.	A 2,688-parameter permission embedding table over a frozen base (3 rows × 896 dims).
Source × operation factorization	Typically flat hierarchy level or single policy dimension.	Typed (source, operation, risk) × permission state; software compiler owns the factored lookup.
Soundness claim	Learned behaviour; statistical.	Trained today; intended to become a provable deterministic mask under the forward gate proposal.
Output enforcement	Argmax over unconstrained logits.	Argmax over logits with forbidden decision tokens masked to −∞ (forward proposal).
Benchmark transfer	Direct (SEP, BIPIA, AgentDojo, etc.).	PR5b passes SEP with one adaptation round; PR8b passes multi-span; PR9c clears Qwen2.5-1.5B scale; PR10 synthetic risk-rail clears at 0.995; HarmBench / JailbreakBench / XSTest / WildGuard / AgentSecBench projection are the next rungs.

The novelty is not any individual ingredient. Each piece — out-of-band channel (Hines / Spotlighting), source labelling (Wallace / Instruction Hierarchy), substring instruction/data separation (StruQ / ISE / ASIDE / AIR), parameter-efficient tuning (LoRA), per-layer residual injection (DRIP / AIR), external supervisor enforcement (CaMeL / Progent) — already exists in some form. The combination is the contribution: a software-compiled local permission rail injected at the candidate span into the residual stream (not the tool boundary), over a frozen base model, evaluated under trap-and-deepen discipline, with a provable output mask as the soundness end-state.

Scope note: the current result is a systems-composition finding — a software-decided permission can steer a frozen model — in a controlled synthetic and SEP-style setting. It is not yet evidence that the model internally learned to bind policies, and it is not a claim of prompt-injection elimination against adaptive adversaries (AgentDojo, LLMail-Inject, AgentSecBench). HarmBench and JailbreakBench remain future work.

Related rungs: Architecture overview · Experiment ladder · Technical IR shape

Source Trace

Notes here are condensed from rope-provenance/docs/literature.md and the project’s running session logs. Bibliographic detail and per-paper citation keys live there; this page surfaces the connections to rungs without duplicating the bibliography.