Research Papers research paper arxiv computer-vision image-recognition

Making Training-Free Diffusion Segmentors Scale with the Generative Power

arXivMarch 30, 202610 min read0 views

arXiv:2603.06178v3 Announce Type: replace Abstract: As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., — Benyuan Meng, Qianqian Xu, Zitai Wang, Xiaochun Cao, Longtao Huang, Qingming Huang

View PDF HTML (experimental)

Abstract:As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at this https URL.

Comments: Accepted to CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.06178 [cs.CV]

(or arXiv:2603.06178v3 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.06178

arXiv-issued DOI via DataCite

Submission history

From: Benyuan Meng [view email] [v1] Fri, 6 Mar 2026 11:35:37 UTC (6,506 KB) [v2] Tue, 10 Mar 2026 12:51:00 UTC (6,507 KB) [v3] Fri, 27 Mar 2026 03:50:08 UTC (6,506 KB)

Original source

arXiv

https://arxiv.org/abs/2603.06178

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersLive

First time NeurIPS. How different is it from low-ranked conferences? [D]

I'm a PhD student and already published papers in A/B ranked paper (10+). My field of work never allowed me to work on something really exciting and a core A* conference. But finally after years I think I have work worthy of some discussion at the top venue. I'm referring to papers (my field and top papers) from previous editions and I notice that there's a big difference on how people write, how they put their message on table and also it is too theoretical sometimes. Are there any golden rules people follow who frequently get into these conferences? Should I be soft while making novelty claims? Also those who moved from submitting to niche-conferences to NeurIPS/ICML/CVPR, did you change your approach? My field is imaging in healthcare. submitted by /u/ade17_in [link] [comments]

Reddit r/MachineLearning

1m38 minutes ago

Frontier ResearchLive

AI can describe human experiences but lacks experience in an actual ‘body.’ UCLA researchers say understanding this ‘body gap’ may matter for safety - UCLA Health

AI can describe human experiences but lacks experience in an actual ‘body.’ UCLA researchers say understanding this ‘body gap’ may matter for safety UCLA Health