Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessNode.js Graceful Shutdown in Production: SIGTERM, In-Flight Draining, and Zero-Downtime DeploysDEV CommunityOptimizing Python Web Apps: Reducing High Memory Usage on Shared Servers for Improved PerformanceDEV CommunityWhat Is Agent Observability? Traces, Loop Rate, Tool Errors, and Cost per Successful TaskTowards AII Built a Game That Teaches Git by Making You Type Real CommandsDEV CommunityThe Internet is a Thin Cylinder: Supporting Millions, Supported by OneDEV CommunityPi-hole Setup Guide: Block Ads and Malware for Every Device on Your NetworkDEV CommunityWhy natural transformations?LessWrong AIThe Wrong Way to Use AI for Debugging (And the Mental Model That Actually Works)DEV CommunityThe hidden cost of GPT-4o: what every SaaS founder should know about per-user LLM spend itDEV CommunitySetting Up a Production-Ready Laravel Stack: Nginx, PHP 8.4, MySQL, Valkey & SupervisorDEV CommunityWhy Anthropic Ended Up Fighting the GovernmentTowards AIBlazor WASM's Deputy Thread Model Will Break JavaScript Interop - Here's Why That MattersDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessNode.js Graceful Shutdown in Production: SIGTERM, In-Flight Draining, and Zero-Downtime DeploysDEV CommunityOptimizing Python Web Apps: Reducing High Memory Usage on Shared Servers for Improved PerformanceDEV CommunityWhat Is Agent Observability? Traces, Loop Rate, Tool Errors, and Cost per Successful TaskTowards AII Built a Game That Teaches Git by Making You Type Real CommandsDEV CommunityThe Internet is a Thin Cylinder: Supporting Millions, Supported by OneDEV CommunityPi-hole Setup Guide: Block Ads and Malware for Every Device on Your NetworkDEV CommunityWhy natural transformations?LessWrong AIThe Wrong Way to Use AI for Debugging (And the Mental Model That Actually Works)DEV CommunityThe hidden cost of GPT-4o: what every SaaS founder should know about per-user LLM spend itDEV CommunitySetting Up a Production-Ready Laravel Stack: Nginx, PHP 8.4, MySQL, Valkey & SupervisorDEV CommunityWhy Anthropic Ended Up Fighting the GovernmentTowards AIBlazor WASM's Deputy Thread Model Will Break JavaScript Interop - Here's Why That MattersDEV Community

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2510.02898v4 Announce Type: replace Abstract: Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Inst — Lorenzo Bianchi, Giacomo Pacini, Fabio Carrara, Nicola Messina, Giuseppe Amato, Fabrizio Falchi

View PDF HTML (experimental)

Abstract:Zero-shot captioners are recently proposed models that utilize common-space vision-language representations to caption images without relying on paired image-text data. To caption an image, they proceed by textually decoding a text-aligned image feature, but they limit their scope to global representations and whole-image captions. We present a unified framework for zero-shot captioning that shifts from an image-centric to a patch-centric paradigm, enabling the captioning of arbitrary regions without the need of region-level supervision. Instead of relying on global image representations, we treat individual patches as atomic captioning units and aggregate them to describe arbitrary regions, from single patches to non-contiguous areas and entire images. We analyze the key ingredients that enable current latent captioners to work in our novel proposed framework. Experiments demonstrate that backbones producing meaningful, dense visual features, such as DINO, are key to achieving state-of-the-art performance in multiple region-based captioning tasks. Compared to other baselines and state-of-the-art competitors, our models achieve better performance on zero-shot dense captioning and region-set captioning. We also introduce a new trace captioning task that further demonstrates the effectiveness of patch-wise semantic representations for flexible caption generation. Project page at this https URL .

Comments: IEEE CVF Conference on Computer Vision and Pattern Recognition 2026. Project page with code, models and examples: this https URL

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2510.02898 [cs.CV]

(or arXiv:2510.02898v4 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2510.02898

arXiv-issued DOI via DataCite

Submission history

From: Lorenzo Bianchi [view email] [v1] Fri, 3 Oct 2025 11:05:56 UTC (14,059 KB) [v2] Mon, 6 Oct 2025 08:43:27 UTC (14,059 KB) [v3] Wed, 25 Mar 2026 09:26:00 UTC (15,704 KB) [v4] Fri, 27 Mar 2026 10:43:17 UTC (15,704 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
One Patch t…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 207 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers