Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessPeppa Pig and Transformers owner Hasbro hit by cyber-attackBBC TechnologyA New York Times reporter went to South Dakota to report on Kristi Noem's husband. Then the story broke.Business InsiderCognichip wants AI to design the chips that power AI, and just raised $60M to tryTechCrunch AISpaceX has reportedly filed for the biggest IPO in historyEngadgetThe Trump administration’s antitrust honeymoon is overThe Verge AIAnthropic vs OpenAI: An Investor ShiftBloomberg TechnologyApple turns 50: 8 of the company’s biggest tech milestonesSilicon RepublicSpaceX files confidentially for IPO in mega listing potentially valued at $1.75 trillion, report saysTechCrunch AII Built an AI Agent That Can Write Its Own Tools When It Gets StuckDEV CommunityBuilding a "Soft Sensor" for Cement Kilns: Predicting Control Levers with PythonDEV CommunityWe Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 SecondsDEV CommunityAgent Self-Discovery: How AI Agents Find Their Own WalletsDEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessPeppa Pig and Transformers owner Hasbro hit by cyber-attackBBC TechnologyA New York Times reporter went to South Dakota to report on Kristi Noem's husband. Then the story broke.Business InsiderCognichip wants AI to design the chips that power AI, and just raised $60M to tryTechCrunch AISpaceX has reportedly filed for the biggest IPO in historyEngadgetThe Trump administration’s antitrust honeymoon is overThe Verge AIAnthropic vs OpenAI: An Investor ShiftBloomberg TechnologyApple turns 50: 8 of the company’s biggest tech milestonesSilicon RepublicSpaceX files confidentially for IPO in mega listing potentially valued at $1.75 trillion, report saysTechCrunch AII Built an AI Agent That Can Write Its Own Tools When It Gets StuckDEV CommunityBuilding a "Soft Sensor" for Cement Kilns: Predicting Control Levers with PythonDEV CommunityWe Traced One Query Through Perplexity’s Entire Stack in Cohort – Here’s What Actually Happens in 3 SecondsDEV CommunityAgent Self-Discovery: How AI Agents Find Their Own WalletsDEV Community

LIBERO-Para: A Diagnostic Benchmark and Metrics for Paraphrase Robustness in VLA Models

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28301v1 Announce Type: new Abstract: Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Acros — Chanyoung Kim, Minwoo Kim, Minseok Kang, Hyunwoo Kim, Dahuin Jung

View PDF HTML (experimental)

Abstract:Vision-Language-Action (VLA) models achieve strong performance in robotic manipulation by leveraging pre-trained vision-language backbones. However, in downstream robotic settings, they are typically fine-tuned with limited data, leading to overfitting to specific instruction formulations and leaving robustness to paraphrased instructions underexplored. To study this gap, we introduce LIBERO-Para, a controlled benchmark that independently varies action expressions and object references for fine-grained analysis of linguistic generalization. Across seven VLA configurations (0.6B-7.5B), we observe consistent performance degradation of 22-52 pp under paraphrasing. This degradation is primarily driven by object-level lexical variation: even simple synonym substitutions cause large drops, indicating reliance on surface-level matching rather than semantic grounding. Moreover, 80-96% of failures arise from planning-level trajectory divergence rather than execution errors, showing that paraphrasing disrupts task identification. Binary success rate treats all paraphrases equally, obscuring whether models perform consistently across difficulty levels or rely on easier cases. To address this, we propose PRIDE, a metric that quantifies paraphrase difficulty using semantic and syntactic factors. Our benchmark and corresponding code are available at: this https URL

Comments: 32 pages, 28 figures

Subjects:

Machine Learning (cs.LG)

Cite as: arXiv:2603.28301 [cs.LG]

(or arXiv:2603.28301v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.28301

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chanyoung Kim [view email] [v1] Mon, 30 Mar 2026 11:27:34 UTC (6,679 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
LIBERO-Para…researchpaperarxivmachine-lea…deep-learni…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 154 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers