Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessBest last-minute Amazon Spring Sale tablet deals 2026ZDNet Big DataBuilding a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real ImplementationDEV CommunityWe Built an AI That Rewrites Its Own Brain. Here's What Happened.DEV CommunityCome ho costruito un generatore di testi AI moderno con React e VercelDEV CommunityHow CoinFello's MinChi Park Built the Trust Layer 500 Million Crypto Users Have Been Waiting ForHackernoon AII'm 18 and Built an Open-Source Camera That Cryptographically Proves Photos Are RealDEV CommunityMastering FastAPI: A Complete Learning RoadmapDEV CommunityI Built an AI Chatbot That Knows Everything About MeDEV CommunityEscaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 & DeepSeek-R1)DEV CommunityI Built a FusionAuth SDK in Brainfuck and I'm Not Even a DeveloperDEV CommunityBattle Mage: We Built a Codebase Expert That Lives in SlackDEV CommunityBest Amazon Spring Sale deals under $25: Last chance to saveZDNet AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessBest last-minute Amazon Spring Sale tablet deals 2026ZDNet Big DataBuilding a RAG Pipeline From Scratch With LangChain + Pinecone + Claude: A Real ImplementationDEV CommunityWe Built an AI That Rewrites Its Own Brain. Here's What Happened.DEV CommunityCome ho costruito un generatore di testi AI moderno con React e VercelDEV CommunityHow CoinFello's MinChi Park Built the Trust Layer 500 Million Crypto Users Have Been Waiting ForHackernoon AII'm 18 and Built an Open-Source Camera That Cryptographically Proves Photos Are RealDEV CommunityMastering FastAPI: A Complete Learning RoadmapDEV CommunityI Built an AI Chatbot That Knows Everything About MeDEV CommunityEscaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 & DeepSeek-R1)DEV CommunityI Built a FusionAuth SDK in Brainfuck and I'm Not Even a DeveloperDEV CommunityBattle Mage: We Built a Codebase Expert That Lives in SlackDEV CommunityBest Amazon Spring Sale deals under $25: Last chance to saveZDNet AI

See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2509.02028v3 Announce Type: replace Abstract: Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fu — Halima Bouzidi, Haoyu Liu, Mohammad Abdullah Al Faruque

View PDF HTML (experimental)

Abstract:Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.

Comments: Accepted to the NeurIPS 2025 Workshop on Reliable ML from Unreliable Data

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)

Cite as: arXiv:2509.02028 [cs.CV]

(or arXiv:2509.02028v3 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2509.02028

arXiv-issued DOI via DataCite

Submission history

From: Halima Bouzidi [view email] [v1] Tue, 2 Sep 2025 07:17:32 UTC (1,322 KB) [v2] Wed, 3 Sep 2025 02:28:19 UTC (1,322 KB) [v3] Sat, 28 Mar 2026 03:31:51 UTC (1,321 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
See No Evil…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 232 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers