Levers — the ladder, rung by rung

A lever is one candidate steering mechanism: a way to nudge the model, paired with a pre-registered experiment and a keep/kill decision rule. The Steering Ladder is the ordered set of levers below, climbed one rung at a time. New to the terms here? The explainer introduces each one from intuition first.

Calibration. Of the six rungs, two are measured (S1.0 REFINE, S2.0 KILL — both causal probes, not steering wins) and four are planned (pre-registered, not yet run). The deployable primitive (S2.1) has no results yet. Read the status column literally.

Strategy: task-agnostic substrate + lightweight task-conditioned overlay — a reusable, compute-once map of safe-to-nudge components, plus a thin per-prompt layer that picks the steering direction.

Hard masks (the causal probes) test whether selectivity exists at all. Soft router-logit biases are the intended deployable primitive. Routing readouts are the intended cheap task inference. The program is built on the overlay logic, not on per-domain calibration overhead — but only the causal-probe rungs have run so far.

Lever map

ID Lever Mechanism Status
S0 Universal safety map head/expert redundancy from S1.0 byproduct + R5 weak-mask partially shipped (S1.0 REFINE redundancy signal)
S1 Head-class soft steer soft scaling on attention head outputs (not zero) S1.0 REFINE ceiling; soft variant queued
S2.0 Expert hard-mask ceiling skip masked experts entirely KILL — closed 2026-05-04 (causal screen passed but Δlogit metric does not establish steering; commit d89f4d9)
S2.1 Soft router-logit bias (deployable primitive) router_logits[task_experts] += ε with dose-response sweep pre-registered, queued post-S2.0
S3.0 Routing-readout task detector top-k expert distribution → cosine match vs task prior pre-registered, cheap
S4.0 Adaptive runtime bias = S3.0 ⨂ S2.1 composes detector with bias selector gated on S2.1 + S3.0 SHIP

Why this ordering

  1. S0 universal safety is the substrate — amortizes compute across all domains. Identifies generally redundant heads/experts and perturbation magnitudes that preserve coherence. S1.0 already produced the head-level redundancy map as a byproduct (principled head-mask is less damaging than random — taxonomy identifies expendable heads).

  2. S1 head soft-scale — S1.0 hard-zero was REFINE because principled-zero ≡ soft steer (alternative correct answers) but the locked metric couldn’t separate from random hard-zero (degeneracy collapse). S1.1 soft-scale tests whether the steer regime survives at non-zero scaling, with coherence metric.

  3. S2.0 hard-mask — KILL, closed 2026-05-04. The causal screen passed, but the Δlogit metric does not establish steering; commit d89f4d9 records the closure. S2.1 needs coherence/task-quality metrics rather than inheriting a steering claim from the hard-mask ceiling.

  4. S2.1 soft router-logit bias — the deployable steering primitive. Same calibration as S2.0; replaces hard skip with router_logits[task_experts] += ε. Dose-response over ε ∈ {0.5, 1.0, 2.0, 4.0} nat. Smallest ε with selectivity AND coherence is the claim.

  5. S3.0 prompt-time routing readout — calibration-free task detector. Per-prompt forward → top-k expert distribution → cosine match vs cached priors. Sub-second overhead. Enables runtime adaptive policy without per-domain calibration burden.

  6. S4.0 composition — runtime: prompt arrives → S3.0 fingerprint → infer task → S2.1 selects bias direction at small ε → soft-steer model output. Zero training, reusable substrate, lightweight task overlay.

Falsifiable predictions