Steering Ladder — technical overview
Dense summary for readers who already know the area. New here? Start with the explainer. S1.0 has not shipped — most rungs below are pre-registered design intent, not measured results; statuses are tagged inline.
Task-agnostic substrate + lightweight task-conditioned overlay. Compression-ladder
negative results expose runtime intervention handles. The intended deployable
primitive is soft router-logit biasing on calibration-free routing readouts, not
per-domain hard-masks. All empirical work is on allenai/OLMoE-1B-7B-0924.
Status (2026-05-04)
| ID | Lever | Status | Calibration |
|---|---|---|---|
| S0 | Universal safety map (head/expert redundancy) | partial — S1.0 byproduct landed | head-redundancy map measured; full map planned |
| S1.0 | Head-class hard-zero | REFINE (zero-mask blunt; soft variant queued) | measured (causal screen) |
| S2.0 | Expert hard-mask ceiling | KILL on steering, causal screen PASS (G3+G5+G6+G7) | measured (causal screen) |
| S2.1 | Soft router-logit bias (deployable primitive) | pre-registered, queued | planned |
| S3.0 | Prompt-time routing-readout task detector | pre-registered, cheap | planned |
| S4.0 | Runtime composition (S3.0 ⨂ S2.1) | gated on S2.1+S3.0 SHIP | planned |
Pages
- Causal probe — S1.0, S2.0 hard ablations (causal claim only)
- Soft control — S2.1 dose-response (planned steering primitive)
- Practical steering — S3.0 + S4.0 runtime (planned)
- Levers — mechanism + ladder ordering
- Headline samples — placeholder until S1.0 ships
- Methodology — pre-reg + trap discipline
- Known limits — single-arch caveat, metric demotion, ablation-vs-steering boundary
Thesis
Per-head and per-expert calibration signatures are stable, task-discriminative, and direction-reversing across corpora (measured, OLMoE-only). None of these properties needed probe training or SAE infrastructure — they fall out of compression-ladder forensics. The steering paper is the positive reframe: redundancy maps + soft-bias primitives + cheap task detectors are conjectured to compose into runtime control without per-domain calibration cost. That composition is the program’s central hypothesis, not yet a demonstrated outcome.