Causal probes

A causal probe is the bluntest possible test: switch a component fully off (a “hard ablation”) and see whether the model’s behavior changes in a task-specific way. It answers one narrow question and deliberately leaves a different, harder one open.

Calibration. Both experiments on this page (S1.0, S2.0) have run — their results are measured. But both are causal screens, not steering wins: S1.0 closed REFINE, S2.0 closed KILL-on-steering. Neither establishes that you can steer the model usefully. The steering claims live on the (not-yet-run) soft control and practical steering pages.

Question answered: is component X causally task-selective?

Question NOT answered: can we steer usefully via X?

The distinction is from the 2026-05-04 design critique. Causal probe = hard ablation. Steering = smooth controllable perturbation that preserves competence. The same component can pass causal probe and fail as steering primitive (because ablation collapses coherence rather than producing a coherent alternative).

Levels

S1.0 — Head-class hard zero

Mask attention head outputs by hydro taxonomy class.

S2.0 — Expert hard mask

Skip masked experts entirely (R5 mechanism with task-cal mask source).

What causal probes give

What causal probes do NOT give

Discipline lessons applied

Per 2026-05-04 critique:

  1. Match intervention budget exactly (S2.0 has 13/64 × 16 = 208 experts per condition; matched principled / random / cross-corpus).
  2. Match layers exactly (S2.0 patches all 16 layers per condition).
  3. Change one variable per condition (S2.0 conditions vary mask source only).
  4. Random control at same layer + count (S2.0 C4 random_13_per_layer).
  5. Separate causal from steering (S2.0 explicitly causal; S2.1 explicitly steering).
  6. Quality metrics (added to S2.0 post-completion analysis; baked into S2.1).
  7. Task-agnostic safety map first (S0.0 in flight).

S1.0 violated points 1, 2, 4, 6. Postmortem honest about it.