Live
Black Hat USADark ReadingBlack Hat AsiaAI Business5 best practices to secure AI systemsAI NewsAI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap - the-decoder.comGoogle News - AI roboticsVulkan backend much easier on the CPU and GPU memory than CUDA.Reddit r/LocalLLaMAAn interview with Mustafa Suleyman on Microsoft s AI reorg, how revising its OpenAI contract "unlocked [Microsoft s] ability to pursue superintelligence", more (Hayden Field/The Verge)TechmemeTikTok's 'hidden game' shows it wants even more of our timeCreative Bloq AI DesignUS crude tops US$110, Wall Street falls after Trump vows more Iran attacksSCMP Tech (Asia AI)Qwen3.6-Plus: Towards Real World AgentsHacker News TopUnlocking the promise of smart factories: Advanced analytics powered by 5G provides a road map to the futureTech Monitor1.13.0a7CrewAI ReleasesCalls to Regulate Smart Glasses Are Officially DeafeningGizmodoUMW Inaugural AI Expert-in-Residence Shares Insight on Technology’s ‘Tremendous’ Impact - University of Mary WashingtonGoogle News: AIAmazon vs. Apple: Which Is the Better Artificial Intelligence (AI) Stock to Buy Today? - The Motley FoolGoogle News: AIBlack Hat USADark ReadingBlack Hat AsiaAI Business5 best practices to secure AI systemsAI NewsAI models fail at robot control without human-designed building blocks but agentic scaffolding closes the gap - the-decoder.comGoogle News - AI roboticsVulkan backend much easier on the CPU and GPU memory than CUDA.Reddit r/LocalLLaMAAn interview with Mustafa Suleyman on Microsoft s AI reorg, how revising its OpenAI contract "unlocked [Microsoft s] ability to pursue superintelligence", more (Hayden Field/The Verge)TechmemeTikTok's 'hidden game' shows it wants even more of our timeCreative Bloq AI DesignUS crude tops US$110, Wall Street falls after Trump vows more Iran attacksSCMP Tech (Asia AI)Qwen3.6-Plus: Towards Real World AgentsHacker News TopUnlocking the promise of smart factories: Advanced analytics powered by 5G provides a road map to the futureTech Monitor1.13.0a7CrewAI ReleasesCalls to Regulate Smart Glasses Are Officially DeafeningGizmodoUMW Inaugural AI Expert-in-Residence Shares Insight on Technology’s ‘Tremendous’ Impact - University of Mary WashingtonGoogle News: AIAmazon vs. Apple: Which Is the Better Artificial Intelligence (AI) Stock to Buy Today? - The Motley FoolGoogle News: AI
AI NEWS HUBbyEIGENVECTOREigenvector

CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

arXivMarch 31, 20262 min read1 views
Source Quiz

arXiv:2602.22419v2 Announce Type: replace Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small- — Marc-Antoine Lavoie, Anas Mahmoud, Aldo Zaimi, Arsene Fansi Tchango, Steven L. Waslander

View PDF HTML (experimental)

Abstract:CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

Comments: 20 pages, 15 figures, to be published in the CVPR 2026 proceedings

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2602.22419 [cs.CV]

(or arXiv:2602.22419v2 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2602.22419

arXiv-issued DOI via DataCite

Submission history

From: Marc-Antoine Lavoie [view email] [v1] Wed, 25 Feb 2026 21:27:00 UTC (17,301 KB) [v2] Sat, 28 Mar 2026 14:15:06 UTC (17,007 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
CLIP Is Sho…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 141 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers