Levers — the ladder, rung by rung

A lever is one candidate steering mechanism: a way to nudge the model, paired with a pre-registered experiment and a keep/kill decision rule. The Steering Ladder is the ordered set of levers below, climbed one rung at a time. New to the terms here? The explainer introduces each one from intuition first.

Calibration. Of the six rungs, two are measured (S1.0 REFINE, S2.0 KILL — both causal probes, not steering wins) and four are planned (pre-registered, not yet run). The deployable primitive (S2.1) has no results yet. Read the status column literally.

Strategy: task-agnostic substrate + lightweight task-conditioned overlay — a reusable, compute-once map of safe-to-nudge components, plus a thin per-prompt layer that picks the steering direction.

Hard masks (the causal probes) test whether selectivity exists at all. Soft router-logit biases are the intended deployable primitive. Routing readouts are the intended cheap task inference. The program is built on the overlay logic, not on per-domain calibration overhead — but only the causal-probe rungs have run so far.

Lever map

ID	Lever	Mechanism	Status
S0	Universal safety map	head/expert redundancy from S1.0 byproduct + R5 weak-mask	partially shipped (S1.0 REFINE redundancy signal)
S1	Head-class soft steer	soft scaling on attention head outputs (not zero)	S1.0 REFINE ceiling; soft variant queued
S2.0	Expert hard-mask ceiling	skip masked experts entirely	KILL — closed 2026-05-04 (causal screen passed but Δlogit metric does not establish steering; commit d89f4d9)
S2.1	Soft router-logit bias (deployable primitive)	`router_logits[task_experts] += ε` with dose-response sweep	pre-registered, queued post-S2.0
S3.0	Routing-readout task detector	top-k expert distribution → cosine match vs task prior	pre-registered, cheap
S4.0	Adaptive runtime bias = S3.0 ⨂ S2.1	composes detector with bias selector	gated on S2.1 + S3.0 SHIP

Why this ordering

S0 universal safety is the substrate — amortizes compute across all domains. Identifies generally redundant heads/experts and perturbation magnitudes that preserve coherence. S1.0 already produced the head-level redundancy map as a byproduct (principled head-mask is less damaging than random — taxonomy identifies expendable heads).
S1 head soft-scale — S1.0 hard-zero was REFINE because principled-zero ≡ soft steer (alternative correct answers) but the locked metric couldn’t separate from random hard-zero (degeneracy collapse). S1.1 soft-scale tests whether the steer regime survives at non-zero scaling, with coherence metric.
S2.0 hard-mask — KILL, closed 2026-05-04. The causal screen passed, but the Δlogit metric does not establish steering; commit d89f4d9 records the closure. S2.1 needs coherence/task-quality metrics rather than inheriting a steering claim from the hard-mask ceiling.
S2.1 soft router-logit bias — the deployable steering primitive. Same calibration as S2.0; replaces hard skip with router_logits[task_experts] += ε. Dose-response over ε ∈ {0.5, 1.0, 2.0, 4.0} nat. Smallest ε with selectivity AND coherence is the claim.
S3.0 prompt-time routing readout — calibration-free task detector. Per-prompt forward → top-k expert distribution → cosine match vs cached priors. Sub-second overhead. Enables runtime adaptive policy without per-domain calibration burden.
S4.0 composition — runtime: prompt arrives → S3.0 fingerprint → infer task → S2.1 selects bias direction at small ε → soft-steer model output. Zero training, reusable substrate, lightweight task overlay.

Falsifiable predictions

S0: redundancy map generalizes across prompt types (S1.0 result confirms for heads)
S2.0 ceiling: humaneval-strong-mask Δlogit_humaneval ≥ 0.5 nat AND ≤ 0.3 nat on wikitext (selectivity)
S2.1 deployability: smallest ε with selectivity AND coherence preserved
S3.0 detector: classification accuracy ≥ 85% on held-out prompts at sub-second overhead
S4.0 composition: same ε with predicted-task direction matches oracle-task direction within 1.2× selectivity ratio