Research Papers research paper arxiv computer-vision image-recognition

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

arXivMarch 31, 20262 min read1 views

arXiv:2602.22419v2 Announce Type: replace Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small- — Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander

View PDF HTML (experimental)

Abstract:CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

Comments: 20 pages, 15 figures, to be published in the CVPR 2026 proceedings

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2602.22419 [cs.CV]

(or arXiv:2602.22419v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.22419

arXiv-issued DOI via DataCite

Submission history

From: Marc-Antoine Lavoie [view email] [v1] Wed, 25 Feb 2026 21:27:00 UTC (17,301 KB) [v2] Sat, 28 Mar 2026 14:15:06 UTC (17,007 KB)

Original source

arXiv

https://arxiv.org/abs/2602.22419

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research PapersFresh

Quantum computers might crack today's encryption far sooner than we thought

According to a study by engineers at Caltech and the UC Department of Physics, quantum computers do not need to be nearly as powerful as previously believed to crack the most advanced cryptographic technologies. The research claims that Shor's algorithm could break RSA public-key encryption using quantum computers with just... Read Entire Article

TechSpot

1mabout 3 hours ago

ModelsRecent

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models - wsj.com

Exclusive | Caltech Researchers Claim Radical Compression of High-Fidelity AI Models wsj.com

Google News: LLM

1m2 days ago

Analyst NewsRecent

An AI-generated knowledge network of technical terms illustrates trends and reveals new ideas for research in the materials sciences. (Illustration: Thomas Marwitz, KIT) - EurekAlert!

An AI-generated knowledge network of technical terms illustrates trends and reveals new ideas for research in the materials sciences. (Illustration: Thomas Marwitz, KIT) EurekAlert!

GNews AI materials

1mabout 18 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 141 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research PapersFresh

Quantum computers might crack today's encryption far sooner than we thought

TechSpot

1mabout 3 hours ago

Research Papers

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI - WSJ

Exclusive | OpenAI’s Former Research Chief Aims to Automate Manufacturing With AI WSJ

GNews AI manufacturing

1m29 days ago

Research PapersFresh

The Indirect Method for Generating Libraries of Optimal Periodic Trajectories and Its Application to Economical Bipedal Walking

arXiv:2410.09512v2 Announce Type: replace Abstract: Trajectory optimization is an essential tool for generating efficient, dynamically consistent gaits in legged locomotion. This paper explores the indirect method of trajectory optimization, emphasizing its application in creating optimal periodic gaits for legged systems and contrasting it with the more common direct method. While the direct method provides flexibility in implementation, it is limited by its need for an input space parameterization. In contrast, the indirect method improves accuracy by computing the control input from states and costates obtained along the optimal trajectory. In this work, we tackle the convergence challenges associated with indirect shooting methods by utilizing numerical continuation methods. This is pa

arXiv cs.RO

2mabout 11 hours ago

Research PapersFresh

Bistable Quad-Nets Composed of Four-Bar Linkages

arXiv:2604.00527v1 Announce Type: cross Abstract: We study mechanical structures composed of spatial four-bar linkages that are bistable, that is, they allow for two distinct configurations. They have an interpretation as quad nets in the Study quadric which can be used to prove existence of arbitrarily large structures of this type. We propose a purely geometric construction of such examples, starting from infinitesimally flexible quad nets in Euclidean space and applying Whiteley de-averaging. This point of view situates the problem within the broader framework of discrete differential geometry and enables the construction of bistable structures from well-known classes of quad nets, such as discrete minimal surfaces. The proposed construction does not rely on numerical optimization and a

arXiv cs.RO

1mabout 11 hours ago