VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation
arXiv:2603.27060v1 Announce Type: new Abstract: Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistan — Jihwan Hong, Jaeyoung Do
View PDF HTML (experimental)
Abstract:Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the Temporal Dynamic Anchor Updater to maintain temporally adjacent anchor frames that provide stable temporal cues under large motion, occlusion, and reappearance. This unified design achieves state-of-the-art results across diverse RVOS benchmarks under realistic and challenging conditions, demonstrating strong generalization to both referring and reasoning oriented settings. The code and checkpoints are available at this https URL.
Comments: CVPR 2026
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2603.27060 [cs.CV]
(or arXiv:2603.27060v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2603.27060
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Jihwan Hong [view email] [v1] Sat, 28 Mar 2026 00:34:15 UTC (27,536 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
"Be Anything You Want" — OK, Here's How (Technically)
This is a submission for the DEV April Fools Challenge What I Built "I Want To Be..." is a life advice generator that takes your dreams and fulfills them — literally. Want to be rich? Change your name to Richard. Want to be a ninja? Wear all black and move slightly too quietly. People will get the idea. Want to be a cat? Knock something off a table and maintain eye contact. Cat energy. It's a genie who passed the bar exam for loopholes. You asked, we delivered. Technically. 44 categories of deadpan, literally-correct life hacks — from "astronaut" to "wizard" to "left alone" — plus 24 universal fallback answers for the truly original dreamers. Every answer is technically true. None of them are helpful. All of them are stamped 100% LEGIT ADVICE . Demo Try it live on GitHub Pages Type in your
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

This Wi-Fi receiver can work inside a nuclear reactor, keeping robots connected
The research, presented at the IEEE International Solid-State Circuits Conference in San Francisco earlier this year, shows the receiver can continue operating after exposure to 500 kilograys of radiation. That level of endurance far exceeds what even space-grade electronics are designed to handle. Read Entire Article




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!