Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessDryft: What if AI memory worked like an ecosystem instead of a filing cabinet?DEV CommunityWeb Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs ScrapyDEV CommunityQualcomm Joins Korea's 'Challenge AX' Program to Support AI Startups - thelec.netGNews AI KoreaAI Is Turning Film Pitches into Proof—But Korea’s Financing Model Still Lags - KoreaTechDeskGNews AI KoreaFrom Next.js to Pareto: What Changes and What Stays the SameDEV CommunityA Quick Note on Gemma 4 Image Settings in Llama.cppDEV CommunityDoes consciousness and suffering even matter: LLMs and moral relevancelesswrong.comHow to Parse HL7 Messages with AI — Free MCP ServerDEV CommunityGHSA-QCC3-JQWP-5VH2: GHSA-qcc3-jqwp-5vh2: Unauthenticated Resource Exhaustion via LINE Webhook Handler in OpenClawDEV CommunityHow to Hyper-Personalization in Action: From Story Angle to Ranked Media List in MinutesDEV CommunityHow to Scrape DoorDash, Uber Eats, and Grubhub Menu Data in 2026DEV CommunityReverse Engineering Cloudflare's React-Based Bot Detection in 2026DEV CommunityBlack Hat USADark ReadingBlack Hat AsiaAI BusinessDryft: What if AI memory worked like an ecosystem instead of a filing cabinet?DEV CommunityWeb Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs ScrapyDEV CommunityQualcomm Joins Korea's 'Challenge AX' Program to Support AI Startups - thelec.netGNews AI KoreaAI Is Turning Film Pitches into Proof—But Korea’s Financing Model Still Lags - KoreaTechDeskGNews AI KoreaFrom Next.js to Pareto: What Changes and What Stays the SameDEV CommunityA Quick Note on Gemma 4 Image Settings in Llama.cppDEV CommunityDoes consciousness and suffering even matter: LLMs and moral relevancelesswrong.comHow to Parse HL7 Messages with AI — Free MCP ServerDEV CommunityGHSA-QCC3-JQWP-5VH2: GHSA-qcc3-jqwp-5vh2: Unauthenticated Resource Exhaustion via LINE Webhook Handler in OpenClawDEV CommunityHow to Hyper-Personalization in Action: From Story Angle to Ranked Media List in MinutesDEV CommunityHow to Scrape DoorDash, Uber Eats, and Grubhub Menu Data in 2026DEV CommunityReverse Engineering Cloudflare's React-Based Bot Detection in 2026DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

arXivMarch 30, 202610 min read0 views
Source Quiz

arXiv:2603.26127v1 Announce Type: cross Abstract: Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similari — Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

View PDF HTML (experimental)

Abstract:Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions. We analyze this by computing inter-patch similarity using patch-level attention components (query, key, and value) across all layers. We find that: (1) Object-centric properties are encoded in the similarity maps derived from all three components ($q, k, v$), unlike prior work that uses only key features or the [CLS] token. (2) This object-centric information is distributed across the network, not just confined to the final layer. Based on these insights, we introduce Object-DINO, a training-free method that extracts this distributed object-centric information. Object-DINO clusters attention heads across all layers based on the similarities of their patches and automatically identifies the object-centric cluster corresponding to all objects. We demonstrate Object-DINO's effectiveness on two applications: enhancing unsupervised object discovery (+3.6 to +12.4 CorLoc gains) and mitigating object hallucination in Multimodal Large Language Models by providing visual grounding. Our results demonstrate that using this distributed object-centric information improves downstream tasks without additional training.

Comments: Computer Vision and Pattern Recognition (CVPR) 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)

Cite as: arXiv:2603.26127 [cs.CV]

(or arXiv:2603.26127v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.26127

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Samyak Rawlekar [view email] [v1] Fri, 27 Mar 2026 07:22:04 UTC (36,829 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Finding Dis…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 143 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers