Steering Ladder

A gentle, top-to-bottom walkthrough. No background assumed. Read it in order; each term is introduced in plain language first, then named, then detailed.

Read this first — what is and isn’t measured. The Steering Ladder is a research program, not a finished result. As of this writing the first deployable rung (called S1.0) has not shipped. Most of what follows is planned or conjectured: design intent with pre-registered decision rules, not demonstrated outcomes. Two early experiments have closed (one “REFINE”, one “KILL on steering”) — those are measured. Everything else is a plan. Each claim below is tagged so you always know which is which.


1. The starting point: a model made of many small specialists

Some modern language models are not one big network. Instead, each layer holds a pool of small sub-networks — “specialists” — and a tiny traffic controller picks a handful of them to handle each token. The controller is called a router, the specialists are called experts, and the whole design is a Mixture-of-Experts (MoE) model. All the empirical work here runs on one such model, allenai/OLMoE-1B-7B-0924.

Two kinds of internal parts matter for this paper:

If you can nudge which heads or experts get used, you may be able to nudge what the model does — without retraining it. That nudging is what we mean by steering.


2. Where the handles come from: leftovers from a failed compression project

A sibling project tried to compress these models — shrink them by throwing away parts that seemed redundant. Much of that effort failed: the parts could not be safely removed. But the failures were informative. To decide what was safe to cut, the project had to measure every head and every expert — how redundant it was, how it routed, how much error it introduced. Those measurements survived even though the compression did not.

We call these measurements compression byproducts: diagnostic signals that fell out of a compression study as a side effect. The bet of this paper is that the byproducts make good control handles even though they made poor compression targets. Negative compression results, read sideways, become positive steering signal.

Calibration note. That the byproducts are stable and informative is measured (it is what let the compression study draw conclusions). That they make good steering handles is the conjecture this whole program exists to test.


3. One particular byproduct: the hydro decomposition

The richest byproduct is a way of sorting every attention head into a small set of behavioral classes — which heads move information around (“transport”), which sit on the busy routing pathways, which are hybrids. Because the math behind it borrows the language of fluid flow (transport, closure, smoothness), it is named the hydro decomposition. You do not need the fluid analogy to use the result: it is simply a per-head label, computed once, with no probe training and no extra machinery.

The hydro decomposition plays the role that a trained interpretability tool (a “sparse autoencoder”, or SAE) would normally play — but for free, as a byproduct. We call handles obtained this way zero-train steering primitives: control knobs that need no gradient training, no fitted probe, no per-task setup.

Calibration note. The head taxonomy itself is measured and stable across corpora. Its use as a steering primitive is conjectured / in progress.


4. The thing we are building toward: soft, calibration-free control

The naive way to steer is blunt: find the experts a task uses and switch them off (“hard masking”). That answers a narrow question — is this component causally involved? — but it tends to break the output rather than gently redirect it. An early experiment (S2.0) confirmed this: hard expert masking is causally real but does not steer cleanly. That is a measured result (a “KILL on the steering claim”).

So the deployable target is gentler. Instead of switching experts on or off, we plan to tilt the router’s preferences by a small, tunable amount — add a little to the scores of the experts a task favors, subtract a little from the rest. Because we adjust the router’s internal scores (“logits”) smoothly rather than masking, this is called soft router-logit biasing. The size of the nudge is a dial you can turn.

To know which way to tilt, you need to know what task the prompt is. Rather than asking the user to label it, we plan to read the answer off the model’s own routing behavior on the prompt — no calibration data, no fitted classifier. We call these calibration-free routing readouts.

Calibration note. Soft biasing (S2.1) and routing readouts (S3.0) are both pre-registered but not yet run — planned, with decision rules written in advance. No results exist for them yet.


5. The architecture: a shared base plus a thin per-task layer

Put together, the design has two parts:

This split — task-agnostic substrate + task-conditioned overlay — is the whole point. It avoids paying a fresh calibration cost for every domain.


6. The vocabulary, collected

Now that each idea has been introduced from intuition, here are the names in one place:


7. Where the program actually stands

Rung What it is Status (measured vs planned)
S0 Universal safety map (the substrate) Planned. A byproduct from S1.0 gave a partial head-level redundancy map (measured); the full map is pre-registered, not run.
S1.0 Head-class hard-zero (a causal probe) Measured: REFINE. Principled masks were less damaging than random — but the locked metric could not tell coherent steering from collapse.
S2.0 Expert hard-mask ceiling (a causal probe) Measured: KILL on steering. Causally real, but cross-task isolation absent under hard masking.
S2.1 Soft router-logit bias (the deployable primitive) Planned. Pre-registered, queued. No results.
S3.0 Routing-readout task detector Planned. Pre-registered, queued. No results.
S4.0 Runtime composition (S3.0 ⨂ S2.1) Planned. Gated on S2.1 and S3.0 shipping first.

Two rungs are measured; the rest are design intent. The headline demonstration samples will only exist after S1.0 ships.


8. Where to go next