Live
Black Hat USADark ReadingBlack Hat AsiaAI Business🙀 Anthropic accidentally leaked Claude Code's entire source code - The NeuronGoogle News: ClaudeI Built a Python Tool to Check If AI Search Engines Can Find Your WebsiteDEV CommunityFrom AWS Key Leak to evnx: The Origin Story of a Developer's Safety NetDEV CommunityHarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness EngineeringDEV CommunityUnderstanding Gemini: Google’s AI tools, explained - Campaign Middle EastGoogle News: GeminiInside the push to make every employee an AI masterBusiness InsiderThe Convergence of APC and AI: From Advanced Control to Intelligent Operations - ARC AdvisoryGoogle News: Machine LearningAnthropic Accidentally Leaks Entire Claude Code Source Code Online - trendingtopics.euGoogle News: ClaudeBuilding a Decentralized Prediction Market: A Full-Stack Architecture GuideDEV CommunityASUS Announces UGen300 USB AI Accelerator - ASUS PressroomGoogle News: Generative AIHow Rust's Ownership Model Prevents Bugs — A Visual GuideDEV CommunityHow I Built an AI Tool to Generate US Visa Photos (And Why Most Photos Fail)DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI Business🙀 Anthropic accidentally leaked Claude Code's entire source code - The NeuronGoogle News: ClaudeI Built a Python Tool to Check If AI Search Engines Can Find Your WebsiteDEV CommunityFrom AWS Key Leak to evnx: The Origin Story of a Developer's Safety NetDEV CommunityHarnessOS: scaffold/middleware for infinite autonomous tasks — built on Harness EngineeringDEV CommunityUnderstanding Gemini: Google’s AI tools, explained - Campaign Middle EastGoogle News: GeminiInside the push to make every employee an AI masterBusiness InsiderThe Convergence of APC and AI: From Advanced Control to Intelligent Operations - ARC AdvisoryGoogle News: Machine LearningAnthropic Accidentally Leaks Entire Claude Code Source Code Online - trendingtopics.euGoogle News: ClaudeBuilding a Decentralized Prediction Market: A Full-Stack Architecture GuideDEV CommunityASUS Announces UGen300 USB AI Accelerator - ASUS PressroomGoogle News: Generative AIHow Rust's Ownership Model Prevents Bugs — A Visual GuideDEV CommunityHow I Built an AI Tool to Generate US Visa Photos (And Why Most Photos Fail)DEV Community

Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.24257v2 Announce Type: replace Abstract: Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handle — Tommaso Galliena, Stefano Rosa, Tommaso Apicella, Pietro Morerio, Alessio Del Bue, Lorenzo Natale

View PDF

Abstract:Vision-Language Models (VLMs) often yield inconsistent descriptions of the same object across viewpoints, hindering the ability of embodied agents to construct consistent semantic representations over time. Previous methods resolved inconsistencies using offline multi-view aggregation or multi-stage pipelines that decouple exploration, data association, and caption learning, with limited capacity to reason over previously observed objects. In this paper, we introduce a unified, memory-augmented Vision-Language agent that simultaneously handles data association, object captioning, and exploration policy within a single autoregressive framework. The model processes the current RGB observation, a top-down explored map, and an object-level episodic memory serialized into object-level tokens, ensuring persistent object identity and semantic consistency across extended sequences. To train the model in a self-supervised manner, we collect a dataset in photorealistic 3D environments using a disagreement-based policy and a pseudo-captioning model that enforces consistency across multi-view caption histories. Extensive evaluation on a manually annotated object-level test set, demonstrate improvements of up to +11.86% in standard captioning scores and +7.39% in caption self-similarity over baseline models, while enabling scalable performance through a compact scene representation. Code, model weights, and data are available at this https URL.

Comments: 24 pages, 7 figures, 7 tables (including Supplementary Materials)

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.24257 [cs.CV]

(or arXiv:2603.24257v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.24257

arXiv-issued DOI via DataCite

Submission history

From: Tommaso Galliena [view email] [v1] Wed, 25 Mar 2026 12:52:32 UTC (5,798 KB) [v2] Mon, 30 Mar 2026 09:01:07 UTC (5,798 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Memory-Augm…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 211 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers