Research Papers research paper arxiv computer-vision image-recognition

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

arXivMarch 31, 20262 min read0 views

arXiv:2603.28568v1 Announce Type: new Abstract: Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are — Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng, Xiang Chen, Jiahuan Long

View PDF HTML (experimental)

Abstract:Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28568 [cs.CV]

(or arXiv:2603.28568v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28568

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xiang Chen [view email] [v1] Mon, 30 Mar 2026 15:24:34 UTC (1,362 KB)

Original source

arXiv

https://arxiv.org/abs/2603.28568

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ModelsFresh

A Player Selection Network for Scalable Game-Theoretic Prediction and Planning

arXiv:2505.00213v3 Announce Type: replace Abstract: While game-theoretic planning frameworks are effective at modeling multi-agent interactions, they require solving large optimization problems where the number of variables increases with the number of agents, resulting in long computation times that limit their use in large-scale, real-time systems. To address this issue, we propose 1) PSN Game-a learning-based, game-theoretic prediction and planning framework that reduces game size by learning a Player Selection Network (PSN); and 2) a Goal Inference Network (GIN) that makes it possible to use the PSN in incomplete-information games where other agents' intentions are unknown to the ego agent. A PSN outputs a player selection mask that distinguishes influential players from less relevant

arXiv cs.RO

2mabout 10 hours ago

Research PapersFresh

The Indirect Method for Generating Libraries of Optimal Periodic Trajectories and Its Application to Economical Bipedal Walking

arXiv:2410.09512v2 Announce Type: replace Abstract: Trajectory optimization is an essential tool for generating efficient, dynamically consistent gaits in legged locomotion. This paper explores the indirect method of trajectory optimization, emphasizing its application in creating optimal periodic gaits for legged systems and contrasting it with the more common direct method. While the direct method provides flexibility in implementation, it is limited by its need for an input space parameterization. In contrast, the indirect method improves accuracy by computing the control input from states and costates obtained along the optimal trajectory. In this work, we tackle the convergence challenges associated with indirect shooting methods by utilizing numerical continuation methods. This is pa

arXiv cs.RO

2mabout 10 hours ago

ProductsFresh

Geometric Visual Servo Via Optimal Transport

arXiv:2506.02768v2 Announce Type: replace Abstract: When developing control laws for robotic systems, the principle factor when examining their performance is choosing inputs that allow smooth tracking to a reference input. In the context of robotic manipulation, this involves translating an object or end-effector from an initial pose to a target pose. Robotic manipulation control laws frequently use vision systems as an error generator to track features and produce control inputs. However, current control algorithms don't take into account the probabilistic features that are extracted and instead rely on hand-tuned feature extraction methods. Furthermore, the target features can exist in a static pose thus allowing a combined pose and feature error for control generation. We present a geo

arXiv cs.RO

2mabout 10 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 222 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

The Indirect Method for Generating Libraries of Optimal Periodic Trajectories and Its Application to Economical Bipedal Walking

arXiv cs.RO

2mabout 10 hours ago

Research PapersFresh

Bistable Quad-Nets Composed of Four-Bar Linkages

arXiv:2604.00527v1 Announce Type: cross Abstract: We study mechanical structures composed of spatial four-bar linkages that are bistable, that is, they allow for two distinct configurations. They have an interpretation as quad nets in the Study quadric which can be used to prove existence of arbitrarily large structures of this type. We propose a purely geometric construction of such examples, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. The proposed construction does not rely on numerical optimization and a

arXiv cs.RO

1mabout 10 hours ago

Research PapersFresh

Implicit Primal-Dual Interior-Point Methods for Quadratic Programming

arXiv:2604.00364v1 Announce Type: cross Abstract: This paper introduces a new method for solving quadratic programs using primal-dual interior-point methods. Instead of handling complementarity as an explicit equation in the Karush-Kuhn-Tucker (KKT) conditions, we ensure that complementarity is implicitly satisfied by construction. This is achieved by introducing an auxiliary variable and relating it to the duals and slacks via a retraction map. Specifically, we prove that the softplus function has favorable numerical properties compared to the commonly used exponential map. The resulting KKT system is guaranteed to be spectrally bounded, thereby eliminating the most pressing limitation of primal-dual methods: ill-conditioning near the solution. These attributes facilitate the solution of

arXiv cs.RO

1mabout 10 hours ago

Research Papers

Graduate student Shehryar Khan advances machine learning research - Virginia Tech News

Graduate student Shehryar Khan advances machine learning research Virginia Tech News

Google News: Machine Learning

1m16 days ago