Activation Steering for Masked Diffusion Language Models

arXivMarch 31, 20262 min read0 views

arXiv:2512.24143v3 Announce Type: replace Abstract: Masked diffusion language models (MDLMs) generate text via iterative masked-token denoising, enabling mask-parallel decoding and distinct controllability and efficiency tradeoffs from autoregressive LLMs. Yet, efficient representation-level mechanisms for inference-time control in MDLMs remain largely unexplored. To address this gap, we introduce an activation steering primitive for MDLMs: we extract a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, and apply a global intervention on residual- — Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid

View PDF HTML (experimental)

Abstract:Masked diffusion language models (MDLMs) generate text via iterative masked-token denoising, enabling mask-parallel decoding and distinct controllability and efficiency tradeoffs from autoregressive LLMs. Yet, efficient representation-level mechanisms for inference-time control in MDLMs remain largely unexplored. To address this gap, we introduce an activation steering primitive for MDLMs: we extract a single low-dimensional direction from contrastive prompt sets using one prompt-only forward pass, and apply a global intervention on residual-stream activations throughout reverse diffusion, without performing optimization or altering the diffusion sampling procedure. Using safety refusal as a deployment-relevant case study, we find that refusal behavior in multiple MDLMs is governed by a consistent, approximately one-dimensional activation subspace. Applying the corresponding direction yields large and systematic behavioral shifts and is substantially more effective than prompt-based and optimization-based baselines. We further uncover diffusion-specific accessibility: effective directions can be extracted not only from post-instruction tokens, but also from pre-instruction tokens that are typically ineffective in autoregressive models due to causal attention. Ablations localize maximal leverage to early denoising steps and mid-to-late transformer layers, with early diffusion blocks contributing disproportionately. Finally, in an MDLM trained on English and Chinese, extracted directions transfer strongly between English and Chinese, but do not reliably generalize to an autoregressive architecture, highlighting architecture-dependent representations of safety constraints.

Comments: Accepted at ReALM-GEN @ ICLR 2026

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2512.24143 [cs.CL]

(or arXiv:2512.24143v3 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2512.24143

arXiv-issued DOI via DataCite

Submission history

From: Raz Lapid [view email] [v1] Tue, 30 Dec 2025 11:10:52 UTC (442 KB) [v2] Mon, 2 Mar 2026 19:32:58 UTC (233 KB) [v3] Sun, 29 Mar 2026 11:09:51 UTC (233 KB)

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Activation Steering for Masked Diffusion Language Models

Submission history

Daily AI Digest

Knowledge Map

Connected Articles — Knowledge Graph

Discussion