Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessEarly Career Award recipient Aleksandra Ćiprijanović aims to create universal AI analysis framework - Fermilab (.gov)Google News: AIExclusive: Miravoice, Builder Of An AI ‘Interviewer’ To Conduct Phone Surveys, Raises $6.3MCrunchbase NewsMoltbook risks: The dangers of AI-to-AI interactions in health carePhys.org AIMaul: Shadow Lord Will Return for Season 2GizmodoMicrosoft Aims to Create Large Cutting-Edge AI Models By 2027Bloomberg TechnologyHow Disney Imagineers are using AI and robotics to reshape the company’s theme parksFast Company TechA jury says Meta and Google hurt a kid. What now?The Verge AII have always seen myself as ‘progressive’ – but with AI it’s time to hit the brakes - The GuardianGoogle News: AIOpenAI Teams Up with Smartly to Create Chatty Ads Inside ChatGPT - TipRanksGoogle News: ChatGPTDOJ to Appeal Court Order Halting Trump’s Ban on Anthropic AIBloomberg TechnologyCapacity and speed: why TikTok shelved its second Irish data centreSilicon RepublicBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessEarly Career Award recipient Aleksandra Ćiprijanović aims to create universal AI analysis framework - Fermilab (.gov)Google News: AIExclusive: Miravoice, Builder Of An AI ‘Interviewer’ To Conduct Phone Surveys, Raises $6.3MCrunchbase NewsMoltbook risks: The dangers of AI-to-AI interactions in health carePhys.org AIMaul: Shadow Lord Will Return for Season 2GizmodoMicrosoft Aims to Create Large Cutting-Edge AI Models By 2027Bloomberg TechnologyHow Disney Imagineers are using AI and robotics to reshape the company’s theme parksFast Company TechA jury says Meta and Google hurt a kid. What now?The Verge AII have always seen myself as ‘progressive’ – but with AI it’s time to hit the brakes - The GuardianGoogle News: AIOpenAI Teams Up with Smartly to Create Chatty Ads Inside ChatGPT - TipRanksGoogle News: ChatGPTDOJ to Appeal Court Order Halting Trump’s Ban on Anthropic AIBloomberg TechnologyCapacity and speed: why TikTok shelved its second Irish data centreSilicon Republic
AI NEWS HUBbyEIGENVECTOREigenvector

XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.28568v1 Announce Type: new Abstract: Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are — Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng, Xiang Chen, Jiahuan Long

View PDF HTML (experimental)

Abstract:Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28568 [cs.CV]

(or arXiv:2603.28568v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28568

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Xiang Chen [view email] [v1] Mon, 30 Mar 2026 15:24:34 UTC (1,362 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
XSPA: Craft…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 222 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers