Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessHow the Amazon Echo learned to talk — and listenThe Verge AIHere's when poker tactics secured Microsoft’s DeepMind deal - The News InternationalGoogle News: DeepMind🔥 sponsors/atilaahmettanerGitHub Trending🔥 google-ai-edge/galleryGitHub Trending🔥 google-deepmind/gemmaGitHub Trending🔥 google-ai-edge/LiteRT-LMGitHub Trending🔥 HKUDS/RAG-AnythingGitHub Trending🔥 sponsors/badlogicGitHub TrendingEverything Works, But Users Are Still Confused: What SaaS Teams Are MissingDEV CommunityARTIFICIAL INTELLIGENCE KEYNOTE SPEAKER FOR CORPORATE EVENTS & AI CONFERENCES - futuristsspeakers.comGoogle News: AI"Be Anything You Want" — OK, Here's How (Technically)DEV CommunityWashington sets new rules for artificial intelligence - seattlered.comGoogle News: AIBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessHow the Amazon Echo learned to talk — and listenThe Verge AIHere's when poker tactics secured Microsoft’s DeepMind deal - The News InternationalGoogle News: DeepMind🔥 sponsors/atilaahmettanerGitHub Trending🔥 google-ai-edge/galleryGitHub Trending🔥 google-deepmind/gemmaGitHub Trending🔥 google-ai-edge/LiteRT-LMGitHub Trending🔥 HKUDS/RAG-AnythingGitHub Trending🔥 sponsors/badlogicGitHub TrendingEverything Works, But Users Are Still Confused: What SaaS Teams Are MissingDEV CommunityARTIFICIAL INTELLIGENCE KEYNOTE SPEAKER FOR CORPORATE EVENTS & AI CONFERENCES - futuristsspeakers.comGoogle News: AI"Be Anything You Want" — OK, Here's How (Technically)DEV CommunityWashington sets new rules for artificial intelligence - seattlered.comGoogle News: AI
AI NEWS HUBbyEIGENVECTOREigenvector

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.27060v1 Announce Type: new Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistan — Jihwan Hong, Jaeyoung Do

View PDF HTML (experimental)

Abstract:Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at this https URL.

Comments: CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.27060 [cs.CV]

(or arXiv:2603.27060v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.27060

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jihwan Hong [view email] [v1] Sat, 28 Mar 2026 00:34:15 UTC (27,536 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
VIRST: Vide…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 162 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!