Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation
arXiv:2603.27993v1 Announce Type: new Abstract: Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in t — Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma, Yongjian Liu, Qing Xie, Bolong Zheng
View PDF HTML (experimental)
Abstract:Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27993 [cs.CV]
(or arXiv:2603.27993v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27993
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Jiachen Li [view email] [v1] Mon, 30 Mar 2026 03:33:10 UTC (15,837 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv![[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-neural-network-P6fqXULWLNUwjuxqUZnB3T.webp)
[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)
We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video inpainting / object removal methods can fill in pixels behind an object (e.g., removing shadows or reflections), but they often fail when the removed object affects the dynamics of the scene. For example: - A domino chain is falling → removing the middle blocks should stop the chain - Two cars are about to crash → removing one car should prevent the collision Current models typically remove the object but leave its effects unchanged, resulting in physically implausible outputs. VOID addresses this by modeling counterfactual scene evolution: “What would the video look like if the object had never been there?” Key ideas: - Counterfactual training data: paire

Rivalry and collaboration attitudes: Study finds writers need both to thrive in the age of AI
When a screenwriter told New York University researchers last year that letting AI do her work would make her "miserable inside," she was onto something. A follow-up study from NYU s Tandon School of Engineering and Stern School of Business finds that the instinct to compete with generative AI, rather than simply embrace it, is associated with meaningful long-term benefits for writing professionals.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

Rivalry and collaboration attitudes: Study finds writers need both to thrive in the age of AI
When a screenwriter told New York University researchers last year that letting AI do her work would make her "miserable inside," she was onto something. A follow-up study from NYU s Tandon School of Engineering and Stern School of Business finds that the instinct to compete with generative AI, rather than simply embrace it, is associated with meaningful long-term benefits for writing professionals.
DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
DynaVid addresses limitations in video diffusion models by using synthetic motion data represented as optical flow to improve realistic video synthesis with dynamic motions and fine-grained motion control. (2 upvotes on HuggingFace)


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!