Known limits
What follows are the things that could undercut the program’s claims, stated plainly and up front. Read them alongside any result on the other pages.
Single-architecture evidence (OLMoE-only)
All current empirical levers (S1.0 head-class, S2.0 expert hard-mask, S2.1 soft
bias, S3.0 routing readout, S0.0 safety map) run only on allenai/OLMoE-1B-7B-0924.
This is acceptable for internal ladder discipline but is a paper-claim risk.
Exception: the layer-flow steerability ladder (LFS0-LFS_K5, 24 experiments
closed 2026-05-06 with K5 KEEP_LFS2_LFS4_BLOCKED) ran on
Qwen/Qwen2.5-3B-Instruct. Class K (bias/generation decoupling) and Class A.2
kills are Qwen-specific and cannot be merged with OLMoE-based S1.0/S2.0 evidence
without architecture-level qualification.
Per the 2026-05-04 program review:
Single-architecture evidence remains a publication risk. Most major empirical results are OLMoE-only. That is acceptable for internal laddering, but risky for paper claims. Cross-architecture validation should be treated as a separate rung with its own kill rules, not as a vague future wish.
Cross-arch validation rung (deferred — own pre-reg)
S5.0 cross-arch validation: replicate the strongest steering result on:
- Mixtral 8x7B (different routing density)
- Gemma-4 e2b (different MoE topology)
- DeepSeek-MoE (different architecture and routing)
Pre-reg gates (when S5.0 lands):
- ε_safe per (component, layer) generalizes within 1.5× across architectures
- Direction-reversal sign holds (not magnitude — magnitude can vary per arch)
- Routing readout (S3.0) classification generalizes ≥75% on cross-arch held-out
- Soft-bias dose-response (S2.1) regime exists at qualitatively similar ε
If any cross-arch test fails: paper claims must be qualified to “OLMoE-class fused-MoE” rather than “MoE in general.”
Drift-only metrics demoted, but still in dataset
S1.0 / S2.0 use Δlogit on baseline-greedy as primary metric. Per critique, this penalizes coherent alternatives. S2.1 / S0.0 promote quality + coherence; drift is reference only.
When citing S1.0 / S2.0 in paper prose: explicitly note metric limitation.
Hard ablation ≠ steering claim
Per critique:
Hard ablation answers “is this component causal?” Useful steering asks “can we move behavior smoothly while preserving quality?” S1.0 and S2.0 are causal screens. S2.1 is closer to steering.
Paper sections explicitly partition: causal_probe (S1.0, S2.0) vs soft_control (S2.1) vs practical_steering (S3.0, S4.0). Do not conflate.