Research Papers research paper arxiv computer-vision image-recognition

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

arXivMarch 31, 20262 min read1 views

arXiv:2603.28565v1 Announce Type: cross Abstract: Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We condu — Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian Mo, Qihua Xiao, XiaoShuai Peng, Qingmin Liao, Yu Wang

Authors:Yiran Shi, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chao Yu, ZhiJian Mo, Qihua Xiao, XiaoShuai Peng, Qingmin Liao, Yu Wang

View PDF HTML (experimental)

Abstract:Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a "streaming" manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28565 [cs.RO]

(or arXiv:2603.28565v1 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2603.28565

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yiran Shi [view email] [v1] Mon, 30 Mar 2026 15:23:27 UTC (4,298 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28565

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Models

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models WSJ

Google News: LLM

1m3 days ago

Research PapersLive

Seeking arXiv cs.AI endorsement — neuroscience-inspired memory architecture for AI agents

Hi everyone, I’m an independent researcher (Zensation AI) seeking endorsement for my first arXiv submission in cs.AI. Paper: “ZenBrain: A Neuroscience-Inspired 7-Layer Memory Architecture for Autonomous AI Systems” Summary: ZenBrain is the first AI memory system grounded in cognitive neuroscience. It implements 7 memory layers (working, short-term, episodic, semantic, procedural, core, cross-context) with 12 algorithms including Hebbian learning, FSRS spaced repetition, sleep-time consolidation (Stickgold & Walker 2013), and Bayesian confidence propagation. Prior art: Published as defensive publication on TDCommons (dpubs_series/9683) and archived on Zenodo (DOI: 10.5281/zenodo.19353663). Open-source npm packages with 9,000+ tests. Why this matters: Recent surveys (arxiv:2603.07670) identi

discuss.huggingface.co

1mabout 1 hour ago

Research PapersFresh

TTA establishes AI security standards group to address emerging risks - telecompaper.com

TTA establishes AI security standards group to address emerging risks telecompaper.com

GNews AI Korea

1mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 185 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Submission history

Daily AI Digest

More about

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - WSJ

Seeking arXiv cs.AI endorsement — neuroscience-inspired memory architecture for AI agents

TTA establishes AI security standards group to address emerging risks - telecompaper.com

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Research Papers

Seeking arXiv cs.AI endorsement — neuroscience-inspired memory architecture for AI agents

TTA establishes AI security standards group to address emerging risks - telecompaper.com

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Tech bills of the week: quantum computing research; AI workforce development; and more - Nextgov/FCW