Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessDefending Habit Streakslesswrong.comChinese-made drone shot in Iran, Australia’s diplomacy pivot: 5 weekend reads you missedSCMP Tech (Asia AI)How NLP Actually Understands Text?Medium AIXENONOSTRA RESEARCH NOTES ALGEBROS: An Algebraic Meta-Language for Code Structure Extraction and…Medium AI18 Specific Tutorial Ideas for AI Voice Integration Using Vapi and TwilioDev.to AIUI DESIGNERS IN TROUBLEMedium AIEstimates of the expected utility gain of AI Safety Researchlesswrong.comMastering Python for Machine Learning: A Practical, No-Nonsense RoadmapMedium AII Audited 13 AI Agent Platforms for Security Misconfigurations — Here's the Open-Source Scanner I BuiltDev.to AIFrom Reality to Writing: Why I Explore Technology, Identity and Human BehaviorMedium AIA Developer's Introduction to Generative AIDEV CommunityAnthropic Looked Inside Claude’s Brain. What They Found Changes Everything.Medium AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessDefending Habit Streakslesswrong.comChinese-made drone shot in Iran, Australia’s diplomacy pivot: 5 weekend reads you missedSCMP Tech (Asia AI)How NLP Actually Understands Text?Medium AIXENONOSTRA RESEARCH NOTES ALGEBROS: An Algebraic Meta-Language for Code Structure Extraction and…Medium AI18 Specific Tutorial Ideas for AI Voice Integration Using Vapi and TwilioDev.to AIUI DESIGNERS IN TROUBLEMedium AIEstimates of the expected utility gain of AI Safety Researchlesswrong.comMastering Python for Machine Learning: A Practical, No-Nonsense RoadmapMedium AII Audited 13 AI Agent Platforms for Security Misconfigurations — Here's the Open-Source Scanner I BuiltDev.to AIFrom Reality to Writing: Why I Explore Technology, Identity and Human BehaviorMedium AIA Developer's Introduction to Generative AIDEV CommunityAnthropic Looked Inside Claude’s Brain. What They Found Changes Everything.Medium AI
AI NEWS HUBbyEIGENVECTOREigenvector

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2511.23170v4 Announce Type: replace Abstract: Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaus — Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota

View PDF HTML (experimental)

Abstract:Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2511.23170 [cs.CV]

(or arXiv:2511.23170v4 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.23170

arXiv-issued DOI via DataCite

Submission history

From: Masaki Kawamura [view email] [v1] Fri, 28 Nov 2025 13:28:18 UTC (4,668 KB) [v2] Mon, 1 Dec 2025 07:34:37 UTC (4,668 KB) [v3] Thu, 5 Mar 2026 09:44:46 UTC (4,844 KB) [v4] Sat, 28 Mar 2026 06:36:30 UTC (4,849 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
PowerCLIP: …researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 256 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers