Research Papers research paper arxiv computer-vision image-recognition

Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

arXivMarch 31, 20262 min read0 views

arXiv:2603.27993v1 Announce Type: new Abstract: Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in t — Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma, Yongjian Liu, Qing Xie, Bolong Zheng

View PDF HTML (experimental)

Abstract:Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27993 [cs.CV]

(or arXiv:2603.27993v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27993

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jiachen Li [view email] [v1] Mon, 30 Mar 2026 03:33:10 UTC (15,837 KB)

Original source

arXiv

https://arxiv.org/abs/2603.27993

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI WSJ

GNews AI manufacturing

1mabout 1 month ago

ModelsFresh

[R] VOID: Video Object and Interaction Deletion (physically-consistent video inpainting)

We present VOID, a model for video object removal that aims to handle *physical interactions*, not just appearance. Most existing video inpainting / object removal methods can fill in pixels behind an object (e.g., removing shadows or reflections), but they often fail when the removed object affects the dynamics of the scene. For example: - A domino chain is falling → removing the middle blocks should stop the chain - Two cars are about to crash → removing one car should prevent the collision Current models typically remove the object but leave its effects unchanged, resulting in physically implausible outputs. VOID addresses this by modeling counterfactual scene evolution: “What would the video look like if the object had never been there?” Key ideas: - Counterfactual training data: paire

Reddit r/MachineLearning

2mabout 2 hours ago

Research PapersLive

Rivalry and collaboration attitudes: Study finds writers need both to thrive in the age of AI

When a screenwriter told New York University researchers last year that letting AI do her work would make her "miserable inside," she was onto something. A follow-up study from NYU s Tandon School of Engineering and Stern School of Business finds that the instinct to compete with generative AI, rather than simply embrace it, is associated with meaningful long-term benefits for writing professionals.

TechXplore AI

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 202 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI WSJ

GNews AI manufacturing

1mabout 1 month ago

Research PapersLive

Rivalry and collaboration attitudes: Study finds writers need both to thrive in the age of AI

TechXplore AI

1mabout 1 hour ago

Research PapersFresh

Switzerland hosts 'CERN of semiconductor research'

Article URL: https://www.swissinfo.ch/eng/swiss-ai/switzerland-hosts-cern-of-semiconductor-research/91015332 Comments URL: https://news.ycombinator.com/item?id=47624879 Points: 16 # Comments: 2

Hacker News Top

5mabout 2 hours ago

Research PapersRecent

DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data

DynaVid addresses limitations in video diffusion models by using synthetic motion data represented as optical flow to improve realistic video synthesis with dynamic motions and fine-grained motion control. (2 upvotes on HuggingFace)

HuggingFace Papers

2m1 day ago