Causal probes

A causal probe is the bluntest possible test: switch a component fully off (a “hard ablation”) and see whether the model’s behavior changes in a task-specific way. It answers one narrow question and deliberately leaves a different, harder one open.

Calibration. Both experiments on this page (S1.0, S2.0) have run — their results are measured. But both are causal screens, not steering wins: S1.0 closed REFINE, S2.0 closed KILL-on-steering. Neither establishes that you can steer the model usefully. The steering claims live on the (not-yet-run) soft control and practical steering pages.

Question answered: is component X causally task-selective?

Question NOT answered: can we steer usefully via X?

The distinction is from the 2026-05-04 design critique. Causal probe = hard ablation. Steering = smooth controllable perturbation that preserves competence. The same component can pass causal probe and fail as steering primitive (because ablation collapses coherence rather than producing a coherent alternative).

Levels

S1.0 — Head-class hard zero

Mask attention head outputs by hydro taxonomy class.

Result: REFINE. Principled mask (transport@L0, routing_heavy@L12) less damaging than budget-matched random mask (per redundancy interpretation).
Causal claim: head-class taxonomy is a basis (principled-vs-random differential is significant).
NOT a steering claim: zero-mask is blunt; locked metric blind to coherent alternatives vs collapse.
Postmortem: cross-check/preregistry/s1_0_head_class_steering_2026-05-03/postmortem.md.

S2.0 — Expert hard mask

Skip masked experts entirely (R5 mechanism with task-cal mask source).

Status: KILL on steering claim, causal screen passes (2026-05-04).
Result file: cross-check/steering/s2_0_expert_mask_steering.v1.json. tau sha256: d15ea23c25408a76157e990b03d5ef9fa2363a3ecde695096c9b8fb677c6c1e5.
Causal screen PASS: G3 weak-preservation (humaneval Δlogit=0.18 < 0.20 cap, R5 mechanism survives NLL→Δlogit transition), G5 random-trap (0.066 < 0.5× strong), G6 all-64 (Δlogit=9.3, 2.51× C2), G7 dense repro (drift=0.0).
Steering gates FAIL: G1 strong-mask he-selectivity (he=3.72 PASS but wt=1.67 over 0.3 cap), G2 wikitext mirror (wt=1.56 PASS but he=1.26 over 0.3 cap), G4 direction-reversal (he_ratio=2.95 PASS, wt_ratio=0.94 FAIL — wikitext-strong leaks broadly into humaneval).
Reframe: expert-class IS causal at 16-layer × top-13 scope, but cross-task isolation under hard masking is absent. The R5 weak-preservation arm lifts to generation level; the strong-mask cross-task selectivity does not.
Mechanism note: R5 SHIP at 580× selectivity was on NLL inflation — an integrated metric. S2.0 measures Δlogit at greedy positions — instantaneous. Both can be true: NLL is dominated by high-confidence positions where bottom-13 mask is identical to baseline; instantaneous Δlogit counts every position and the strong-mask leaks broadly.
Postmortem: cross-check/preregistry/s2_0_expert_mask_steering_2026-05-04/postmortem.md.
Steering torch passes to S2.1 soft router-logit bias under quality+coherence primary, as already pre-registered.

What causal probes give

The substrate for the safety map (S0.0)
Upper bound on what soft-control can achieve (S2.1 cannot exceed S2.0 ceiling)
Identification of NEVER-TOUCH and REDUNDANT cells in the (component, layer) atlas

What causal probes do NOT give

Steering primitive (need S2.1 soft bias)
Practical-task-quality preservation (need quality metrics, not Δlogit)
Cross-task safety (need S0.0 atlas)

Discipline lessons applied

Per 2026-05-04 critique:

Match intervention budget exactly (S2.0 has 13/64 × 16 = 208 experts per condition; matched principled / random / cross-corpus).
Match layers exactly (S2.0 patches all 16 layers per condition).
Change one variable per condition (S2.0 conditions vary mask source only).
Random control at same layer + count (S2.0 C4 random_13_per_layer).
Separate causal from steering (S2.0 explicitly causal; S2.1 explicitly steering).
Quality metrics (added to S2.0 post-completion analysis; baked into S2.1).
Task-agnostic safety map first (S0.0 in flight).

S1.0 violated points 1, 2, 4, 6. Postmortem honest about it.