Levers — the ladder, rung by rung
A lever is one candidate steering mechanism: a way to nudge the model, paired with a pre-registered experiment and a keep/kill decision rule. The Steering Ladder is the ordered set of levers below, climbed one rung at a time. New to the terms here? The explainer introduces each one from intuition first.
Calibration. Of the six rungs, two are measured (S1.0 REFINE, S2.0 KILL — both causal probes, not steering wins) and four are planned (pre-registered, not yet run). The deployable primitive (S2.1) has no results yet. Read the status column literally.
Strategy: task-agnostic substrate + lightweight task-conditioned overlay — a reusable, compute-once map of safe-to-nudge components, plus a thin per-prompt layer that picks the steering direction.
Hard masks (the causal probes) test whether selectivity exists at all. Soft router-logit biases are the intended deployable primitive. Routing readouts are the intended cheap task inference. The program is built on the overlay logic, not on per-domain calibration overhead — but only the causal-probe rungs have run so far.
Lever map
| ID | Lever | Mechanism | Status |
|---|---|---|---|
| S0 | Universal safety map | head/expert redundancy from S1.0 byproduct + R5 weak-mask | partially shipped (S1.0 REFINE redundancy signal) |
| S1 | Head-class soft steer | soft scaling on attention head outputs (not zero) | S1.0 REFINE ceiling; soft variant queued |
| S2.0 | Expert hard-mask ceiling | skip masked experts entirely | KILL — closed 2026-05-04 (causal screen passed but Δlogit metric does not establish steering; commit d89f4d9) |
| S2.1 | Soft router-logit bias (deployable primitive) | router_logits[task_experts] += ε with dose-response sweep |
pre-registered, queued post-S2.0 |
| S3.0 | Routing-readout task detector | top-k expert distribution → cosine match vs task prior | pre-registered, cheap |
| S4.0 | Adaptive runtime bias = S3.0 ⨂ S2.1 | composes detector with bias selector | gated on S2.1 + S3.0 SHIP |
Why this ordering
-
S0 universal safety is the substrate — amortizes compute across all domains. Identifies generally redundant heads/experts and perturbation magnitudes that preserve coherence. S1.0 already produced the head-level redundancy map as a byproduct (principled head-mask is less damaging than random — taxonomy identifies expendable heads).
-
S1 head soft-scale — S1.0 hard-zero was REFINE because principled-zero ≡ soft steer (alternative correct answers) but the locked metric couldn’t separate from random hard-zero (degeneracy collapse). S1.1 soft-scale tests whether the steer regime survives at non-zero scaling, with coherence metric.
-
S2.0 hard-mask — KILL, closed 2026-05-04. The causal screen passed, but the Δlogit metric does not establish steering; commit d89f4d9 records the closure. S2.1 needs coherence/task-quality metrics rather than inheriting a steering claim from the hard-mask ceiling.
-
S2.1 soft router-logit bias — the deployable steering primitive. Same calibration as S2.0; replaces hard skip with
router_logits[task_experts] += ε. Dose-response over ε ∈ {0.5, 1.0, 2.0, 4.0} nat. Smallest ε with selectivity AND coherence is the claim. -
S3.0 prompt-time routing readout — calibration-free task detector. Per-prompt forward → top-k expert distribution → cosine match vs cached priors. Sub-second overhead. Enables runtime adaptive policy without per-domain calibration burden.
-
S4.0 composition — runtime: prompt arrives → S3.0 fingerprint → infer task → S2.1 selects bias direction at small ε → soft-steer model output. Zero training, reusable substrate, lightweight task overlay.
Falsifiable predictions
- S0: redundancy map generalizes across prompt types (S1.0 result confirms for heads)
- S2.0 ceiling: humaneval-strong-mask Δlogit_humaneval ≥ 0.5 nat AND ≤ 0.3 nat on wikitext (selectivity)
- S2.1 deployability: smallest ε with selectivity AND coherence preserved
- S3.0 detector: classification accuracy ≥ 85% on held-out prompts at sub-second overhead
- S4.0 composition: same ε with predicted-task direction matches oracle-task direction within 1.2× selectivity ratio