See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems
arXiv:2509.02028v3 Announce Type: replace Abstract: Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fu — Halima Bouzidi, Haoyu Liu, Mohammad Abdullah Al Faruque
View PDF HTML (experimental)
Abstract:Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.
Comments: Accepted to the NeurIPS 2025 Workshop on Reliable ML from Unreliable Data
Subjects:
Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
Cite as: arXiv:2509.02028 [cs.CV]
(or arXiv:2509.02028v3 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2509.02028
arXiv-issued DOI via DataCite
Submission history
From: Halima Bouzidi [view email] [v1] Tue, 2 Sep 2025 07:17:32 UTC (1,322 KB) [v2] Wed, 3 Sep 2025 02:28:19 UTC (1,322 KB) [v3] Sat, 28 Mar 2026 03:31:51 UTC (1,321 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxiv
How AI-powered echolocation is giving small drones night vision
To help small aerial robots navigate in the dark and other low-visibility environments, my colleagues and I developed an ultrasound-based perception system inspired by bat echolocation. Current robots rely heavily on cameras or light detection and ranging , known as lidar, or both. But these sensors fail in visually challenging conditions, such as smoke, fog, dust, snow, or complete darkness. I’m a scientific engineer who develops bio-inspired microrobots. To solve this challenge, my research team looked at nature’s experts at navigating in poor visibility: bats. They thrive in dark, damp, and dusty caves and can detect obstacles as thin as a human hair using echolocation while weighing as little as two paper clips. They emit sound waves and listen to weak echoes reflected from objects. Ho

Inside CMU’s Push To Transform Treatment for Cancer, Organ Failure and Chronic Disease
<p> <img loading="lazy" src="https://www.cmu.edu/news/sites/default/files/styles/listings_desktop_1x_/public/2026-01/250716A_3D_Bio_Lab234.jpg.webp?itok=f-g_ECey" width="900" height="508" alt="Tissue lab"> </p> Researchers at Carnegie Mellon University are revolutionizing medical care for diseases that impact millions of Americans and the treatments they develop could alleviate major public health challenges.

The At-Home Test That Could Catch Cancer Earlier
<p> <img loading="lazy" src="https://www.cmu.edu/news/sites/default/files/styles/listings_desktop_1x_/public/2026-01/MC-200709A-Nanolab-0658.jpeg.webp?itok=tnAFG-Hk" width="900" height="508" alt="Nanotechnology laboratory"> </p> Researchers at Carnegie Mellon University are developing ways to catch cancer earlier than ever before. The project showed such promise that it was awarded up to $26.7 million in federal funding from the Advanced Research Projects Agency for Health (ARPA-H).
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

How AI-powered echolocation is giving small drones night vision
To help small aerial robots navigate in the dark and other low-visibility environments, my colleagues and I developed an ultrasound-based perception system inspired by bat echolocation. Current robots rely heavily on cameras or light detection and ranging , known as lidar, or both. But these sensors fail in visually challenging conditions, such as smoke, fog, dust, snow, or complete darkness. I’m a scientific engineer who develops bio-inspired microrobots. To solve this challenge, my research team looked at nature’s experts at navigating in poor visibility: bats. They thrive in dark, damp, and dusty caves and can detect obstacles as thin as a human hair using echolocation while weighing as little as two paper clips. They emit sound waves and listen to weak echoes reflected from objects. Ho
"You've got a friend in me": Co-Designing a Peer Social Robot for Young Newcomers' Language and Cultural Learning
arXiv:2603.18804v3 Announce Type: replace-cross Abstract: Community literacy programs supporting young newcomer children in Canada face limited staffing and scarce one-to-one time, which constrains personalized English and cultural learning support. This paper reports on a co-design study with United for Literacy tutors that informed Maple, a table-top, peer-like Socially Assistive Robot (SAR) designed as a practice partner within tutor-mediated sessions. From shadowing and co-design interviews, we derived newcomer-specific requirements and added them in an integrated prototype that uses short story-based activities, multi-modal scaffolding and embedded quizzes that support attention while producing tutor-actionable formative signals. We contribute system design implications for tutor-in-t
Exploring Sidewalk Sheds in New York City through Chatbot Surveys and Human Computer Interaction
arXiv:2601.23095v2 Announce Type: replace Abstract: Sidewalk sheds are a common feature of the streetscape in New York City, reflecting ongoing construction and maintenance activities. However, policymakers and local business owners have raised concerns about reduced storefront visibility and altered pedestrian navigation. Although sidewalk sheds are widely used for safety, their effects on pedestrian visibility and movement are not directly measured in current planning practices. To address this, we developed an AI-based chatbot survey that collects image-based annotations and route choices from pedestrians, linking these responses to specific shed design features, including clearance height, post spacing, and color. This AI chatbot survey integrates a large language model (e.g., Google's
Structured identification of multivariable modal systems
arXiv:2510.10820v2 Announce Type: replace-cross Abstract: Physically interpretable models are essential for next-generation industrial systems, as these representations enable effective control, support design validation, and provide a foundation for monitoring strategies. The aim of this paper is to develop a system identification framework for estimating modal models of complex multivariable mechanical systems from frequency response data. To achieve this, a two-step structured identification algorithm is presented, where an additive model is first estimated using a refined instrumental variable method and subsequently projected onto a modal form. The developed identification method provides accurate, physically-relevant, minimal-order models, for both generally-damped and proportionally

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!