CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
arXiv:2602.22419v2 Announce Type: replace Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small- — Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander
View PDF HTML (experimental)
Abstract:CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.
Comments: 20 pages, 15 figures, to be published in the CVPR 2026 proceedings
Subjects:
Computer Vision and Pattern Recognition (cs.CV)
Cite as: arXiv:2602.22419 [cs.CV]
(or arXiv:2602.22419v2 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2602.22419
arXiv-issued DOI via DataCite
Submission history
From: Marc-Antoine Lavoie [view email] [v1] Wed, 25 Feb 2026 21:27:00 UTC (17,301 KB) [v2] Sat, 28 Mar 2026 14:15:06 UTC (17,007 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
researchpaperarxivQuantum computers might crack today's encryption far sooner than we thought
According to a study by engineers at Caltech and the UC Department of Physics, quantum computers do not need to be nearly as powerful as previously believed to crack the most advanced cryptographic technologies. The research claims that Shor's algorithm could break RSA public-key encryption using quantum computers with just... Read Entire Article
An AI-generated knowledge network of technical terms illustrates trends and reveals new ideas for research in the materials sciences. (Illustration: Thomas Marwitz, KIT) - EurekAlert!
An AI-generated knowledge network of technical terms illustrates trends and reveals new ideas for research in the materials sciences. (Illustration: Thomas Marwitz, KIT) EurekAlert!
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers
Quantum computers might crack today's encryption far sooner than we thought
According to a study by engineers at Caltech and the UC Department of Physics, quantum computers do not need to be nearly as powerful as previously believed to crack the most advanced cryptographic technologies. The research claims that Shor's algorithm could break RSA public-key encryption using quantum computers with just... Read Entire Article
The Indirect Method for Generating Libraries of Optimal Periodic Trajectories and Its Application to Economical Bipedal Walking
arXiv:2410.09512v2 Announce Type: replace Abstract: Trajectory optimization is an essential tool for generating efficient, dynamically consistent gaits in legged locomotion. This paper explores the indirect method of trajectory optimization, emphasizing its application in creating optimal periodic gaits for legged systems and contrasting it with the more common direct method. While the direct method provides flexibility in implementation, it is limited by its need for an input space parameterization. In contrast, the indirect method improves accuracy by computing the control input from states and costates obtained along the optimal trajectory. In this work, we tackle the convergence challenges associated with indirect shooting methods by utilizing numerical continuation methods. This is pa
Bistable Quad-Nets Composed of Four-Bar Linkages
arXiv:2604.00527v1 Announce Type: cross Abstract: We study mechanical structures composed of spatial four-bar linkages that are bistable, that is, they allow for two distinct configurations. They have an interpretation as quad nets in the Study quadric which can be used to prove existence of arbitrarily large structures of this type. We propose a purely geometric construction of such examples, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. The proposed construction does not rely on numerical optimization and a


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!