Research Papers research paper arxiv computer-vision image-recognition

PowerCLIP: Powerset Alignment for Contrastive Pre-Training

arXivMarch 31, 20262 min read0 views

arXiv:2511.23170v4 Announce Type: replace Abstract: Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaus — Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota

View PDF HTML (experimental)

Abstract:Contrastive vision-language pre-training frameworks such as CLIP have demonstrated impressive zero-shot performance across a range of vision-language tasks. Recent studies have shown that aligning individual text tokens with specific image patches or regions enhances fine-grained compositional understanding. However, it remains challenging to capture compositional semantics that span multiple image regions. To address this limitation, we propose PowerCLIP, a novel contrastive pre-training framework enhanced by powerset alignment, which exhaustively optimizes region-to-phrase alignments by minimizing the loss defined between powersets of image regions and textual parse trees. Since the naive powerset construction incurs exponential computational cost due to the combinatorial explosion in the number of region subsets, we introduce efficient non-linear aggregators (NLAs) that reduce complexity from O(2^M) to O(M) with respect to the number of regions M, while approximating the exact loss value with arbitrary precision. Our extensive experiments demonstrate that PowerCLIP outperforms state-of-the-art methods in zero-shot classification and retrieval tasks, underscoring the compositionality and robustness of our approach. Code is available at this https URL.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2511.23170 [cs.CV]

(or arXiv:2511.23170v4 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2511.23170

arXiv-issued DOI via DataCite

Submission history

From: Masaki Kawamura [view email] [v1] Fri, 28 Nov 2025 13:28:18 UTC (4,668 KB) [v2] Mon, 1 Dec 2025 07:34:37 UTC (4,668 KB) [v3] Thu, 5 Mar 2026 09:44:46 UTC (4,844 KB) [v4] Sat, 28 Mar 2026 06:36:30 UTC (4,849 KB)

Original source

arXiv

https://arxiv.org/abs/2511.23170

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

ProductsLive

From Elevation Maps To Contour Lines: SVM and Decision Trees to Detect Violin Width Reduction

arXiv:2604.02446v1 Announce Type: new Abstract: We explore the automatic detection of violin width reduction using 3D photogrammetric meshes. We compare SVM and Decision Trees applied to a geometry-based raw representation built from elevation maps with a more targeted, feature-engineered approach relying on parametric contour lines fitting. Although elevation maps occasionally achieve strong results, their performance does not surpass that of the contour-based inputs.

arXiv cs.CV

1mabout 1 hour ago

ModelsLive

Beyond Fixed Inference: Quantitative Flow Matching for Adaptive Image Denoising

arXiv:2604.02392v1 Announce Type: new Abstract: Diffusion and flow-based generative models have shown strong potential for image restoration. However, image denoising under unknown and varying noise conditions remains challenging, because the learned vector fields may become inconsistent across different noise levels, leading to degraded restoration quality under mismatch between training and inference. To address this issue, we propose a quantitative flow matching framework for adaptive image denoising. The method first estimates the input noise level from local pixel statistics, and then uses this quantitative estimate to adapt the inference trajectory, including the starting point, the number of integration steps, and the step-size schedule. In this way, the denoising process is better

arXiv cs.CV

1mabout 1 hour ago

ModelsLive

Dependency-Guided Parallel Decoding in Discrete Diffusion Language Models

arXiv:2604.02560v1 Announce Type: new Abstract: Discrete diffusion language models (dLLMs) accelerate text generation by unmasking multiple tokens in parallel. However, parallel decoding introduces a distributional mismatch: it approximates the joint conditional using a fully factorized product of per-token marginals, which degrades output quality when selected tokens are strongly dependent. We propose DEMASK (DEpendency-guided unMASKing), a lightweight dependency predictor that attaches to the final hidden states of a dLLM. In a single forward pass, it estimates pairwise conditional influences between masked positions. Using these predictions, a greedy selection algorithm identifies positions with bounded cumulative dependency for simultaneous unmasking. Under a sub-additivity assumption,

arXiv cs.CL

1mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 256 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersLive

Pragmatics Meets Culture: Culturally-adapted Artwork Description Generation and Evaluation

arXiv:2604.02557v1 Announce Type: new Abstract: Language models are known to exhibit various forms of cultural bias in decision-making tasks, yet much less is known about their degree of cultural familiarity in open-ended text generation tasks. In this paper, we introduce the task of culturally-adapted art description generation, where models describe artworks for audiences from different cultural groups who vary in their familiarity with the cultural symbols and narratives embedded in the artwork. To evaluate cultural competence in this pragmatic generation task, we propose a framework based on culturally grounded question answering. We find that base models are only marginally adequate for this task, but, through a pragmatic speaker model, we can improve simulated listener comprehension

arXiv cs.CL

1mabout 1 hour ago

Research PapersLive

Skeleton-based Coherence Modeling in Narratives

arXiv:2604.02451v1 Announce Type: new Abstract: Modeling coherence in text has been a task that has excited NLP researchers since a long time. It has applications in detecting incoherent structures and helping the author fix them. There has been recent work in using neural networks to extract a skeleton from one sentence, and then use that skeleton to generate the next sentence for coherent narrative story generation. In this project, we aim to study if the consistency of skeletons across subsequent sentences is a good metric to characterize the coherence of a given body of text. We propose a new Sentence/Skeleton Similarity Network (SSN) for modeling coherence across pairs of sentences, and show that this network performs much better than baseline similarity techniques like cosine similar

arXiv cs.CL

1mabout 1 hour ago

Research PapersLive

Lipschitz bounds for integral kernels

arXiv:2604.02887v1 Announce Type: new Abstract: Feature maps associated with positive definite kernels play a central role in kernel methods and learning theory, where regularity properties such as Lipschitz continuity are closely related to robustness and stability guarantees. Despite their importance, explicit characterizations of the Lipschitz constant of kernel feature maps are available only in a limited number of cases. In this paper, we study the Lipschitz regularity of feature maps associated with integral kernels under differentiability assumptions. We first provide sufficient conditions ensuring Lipschitz continuity and derive explicit formulas for the corresponding Lipschitz constants. We then identify a condition under which the feature map fails to be Lipschitz continuous and

arXiv stat.ML

2mabout 1 hour ago

Research PapersLive

State estimations and noise identifications with intermittent corrupted observations via Bayesian variational inference

arXiv:2604.02738v1 Announce Type: new Abstract: This paper focuses on the state estimation problem in distributed sensor networks, where intermittent packet dropouts, corrupted observations, and unknown noise covariances coexist. To tackle this challenge, we formulate the joint estimation of system states, noise parameters, and network reliability as a Bayesian variational inference problem, and propose a novel variational Bayesian adaptive Kalman filter (VB-AKF) to approximate the joint posterior probability densities of the latent parameters. Unlike existing AKF that separately handle missing data and measurement outliers, the proposed VB-AKF adopts a dual-mask generative model with two independent Bernoulli random variables, explicitly characterizing both observable communication losses

arXiv stat.ML

1mabout 1 hour ago