Research Papers research paper arxiv computer-vision image-recognition

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

arXivMarch 31, 20262 min read0 views

arXiv:2505.05800v2 Announce Type: replace-cross Abstract: Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to transform real-world observations into low-level control for object interaction. Recent advances in Vision-Language-Action (VLA) models have shown promise by mapping RGB images and language instructions to task space velocities, typically trained on large datasets of teleoperated demonstrations. However, th — Vineet Bhat, Yu-Hsiang Lan, Prashanth Krishnamurthy, Ramesh Karri, Farshad Khorrami

View PDF HTML (experimental)

Abstract:Robotic manipulation in 3D requires effective computation of N degree-of-freedom joint-space trajectories that enable precise and robust control. To achieve this, robots must integrate semantic understanding with visual perception to transform real-world observations into low-level control for object interaction. Recent advances in Vision-Language-Action (VLA) models have shown promise by mapping RGB images and language instructions to task space velocities, typically trained on large datasets of teleoperated demonstrations. However, these models often struggle with generalization beyond their training distributions. In this work, we introduce 3D-CAVLA, a novel finetuning framework that enhances task generalization of VLA policies by incorporating three key components: (i) chain-of-thought reasoning for structured decision-making, (ii) depth-aware perception for 3D spatial understanding, and (iii) task-oriented region-of-interest detection for focused manipulation. Extensive experiments in the LIBERO simulation environment demonstrate that 3D-CAVLA achieves an average success rate of 98.1% across diverse in-domain task suites. On unseen tasks, 3D-CAVLA delivers an absolute improvement of 8.8% in success rate, underscoring the benefits of 3D scene awareness for robust generalization. We validate our approach on real-world tabletop experiments demonstrating that the proposed model translates effectively from simulation to physical robots. 3D-CAVLA achieves over a 3X faster training convergence and delivers a 25% gain in success rate on unseen real world tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: this https URL

Comments: Accepted at the 1st Workshop on 3D LLM/VLA, CVPR 2025. This work has been submitted to the IEEE for possible publication

Subjects:

Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2505.05800 [cs.RO]

(or arXiv:2505.05800v2 [cs.RO] for this version)

https://doi.org/10.48550/arXiv.2505.05800

arXiv-issued DOI via DataCite

Submission history

From: Vineet Bhat [view email] [v1] Fri, 9 May 2025 05:32:40 UTC (10,460 KB) [v2] Mon, 30 Mar 2026 16:30:59 UTC (10,373 KB)

Original source

arXiv

https://arxiv.org/abs/2505.05800

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - wsj.com

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models wsj.com

Google News: LLM

1m2 days ago

Analyst NewsRecent

An AI-generated knowledge network of technical terms illustrates trends and reveals new ideas for research in the materials sciences. (Illustration: Thomas Marwitz, KIT) - EurekAlert!

An AI-generated knowledge network of technical terms illustrates trends and reveals new ideas for research in the materials sciences. (Illustration: Thomas Marwitz, KIT) EurekAlert!

GNews AI materials

1mabout 17 hours ago

AI ToolsFresh

AI’s fluency in other languages hides a Western worldview that can mislead users − a scholar of Indonesian society explains

Research shows that a hidden American worldview can shape AI advice in ways that are culturally misleading.

The Conversation AI

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 137 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI WSJ

GNews AI manufacturing

1m29 days ago

Research PapersFresh

The Indirect Method for Generating Libraries of Optimal Periodic Trajectories and Its Application to Economical Bipedal Walking

arXiv:2410.09512v2 Announce Type: replace Abstract: Trajectory optimization is an essential tool for generating efficient, dynamically consistent gaits in legged locomotion. This paper explores the indirect method of trajectory optimization, emphasizing its application in creating optimal periodic gaits for legged systems and contrasting it with the more common direct method. While the direct method provides flexibility in implementation, it is limited by its need for an input space parameterization. In contrast, the indirect method improves accuracy by computing the control input from states and costates obtained along the optimal trajectory. In this work, we tackle the convergence challenges associated with indirect shooting methods by utilizing numerical continuation methods. This is pa

arXiv cs.RO

2mabout 11 hours ago

Research PapersFresh

Bistable Quad-Nets Composed of Four-Bar Linkages

arXiv:2604.00527v1 Announce Type: cross Abstract: We study mechanical structures composed of spatial four-bar linkages that are bistable, that is, they allow for two distinct configurations. They have an interpretation as quad nets in the Study quadric which can be used to prove existence of arbitrarily large structures of this type. We propose a purely geometric construction of such examples, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. The proposed construction does not rely on numerical optimization and a

arXiv cs.RO

1mabout 11 hours ago

Research PapersFresh

Implicit Primal-Dual Interior-Point Methods for Quadratic Programming

arXiv:2604.00364v1 Announce Type: cross Abstract: This paper introduces a new method for solving quadratic programs using primal-dual interior-point methods. Instead of handling complementarity as an explicit equation in the Karush-Kuhn-Tucker (KKT) conditions, we ensure that complementarity is implicitly satisfied by construction. This is achieved by introducing an auxiliary variable and relating it to the duals and slacks via a retraction map. Specifically, we prove that the softplus function has favorable numerical properties compared to the commonly used exponential map. The resulting KKT system is guaranteed to be spectrally bounded, thereby eliminating the most pressing limitation of primal-dual methods: ill-conditioning near the solution. These attributes facilitate the solution of

arXiv cs.RO

1mabout 11 hours ago