Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessConnecting MCP servers to Amazon Bedrock AgentCore Gateway using Authorization Code flowAWS Machine Learning BlogParsing the AI and gaming future with Nvidia’s Jensen Huang | GTC Q&A - GamesBeatGNews AI NVIDIAStartup Battlefield 200 applications open: A chance for VC access, TechCrunch coverage, and $100KTechCrunch Venture🔥 Jeffallan/claude-skillsGitHub Trending🔥 teng-lin/notebooklm-pyGitHub Trending🔥 HKUDS/DeepTutorGitHub TrendingNebius Stock Rises on $12B Meta AI Deal & Nvidia Investment | 2026 - News and Statistics - indexbox.ioGNews AI NVIDIAHow to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and othersTechCrunch AINvidia’s AI Boom Faces Taiwan Supply Risk - NVIDIA (NASDAQ:NVDA), Taiwan Semiconductor (NYSE:TSM) - BenzingaGNews AI NVIDIAFrom Prompt Engineering to Harness Engineering: The Next Evolution of LLM SystemsTowards AIAdvancing Responsible AI Adoption and Use in the Public Sector: Three Policy Priorities for State LegislationCenter for Democracy & TechnologyAI Agents Summit SeattleAI YouTube Channel 24Black Hat USAAI BusinessBlack Hat AsiaAI BusinessConnecting MCP servers to Amazon Bedrock AgentCore Gateway using Authorization Code flowAWS Machine Learning BlogParsing the AI and gaming future with Nvidia’s Jensen Huang | GTC Q&A - GamesBeatGNews AI NVIDIAStartup Battlefield 200 applications open: A chance for VC access, TechCrunch coverage, and $100KTechCrunch Venture🔥 Jeffallan/claude-skillsGitHub Trending🔥 teng-lin/notebooklm-pyGitHub Trending🔥 HKUDS/DeepTutorGitHub TrendingNebius Stock Rises on $12B Meta AI Deal & Nvidia Investment | 2026 - News and Statistics - indexbox.ioGNews AI NVIDIAHow to use the new ChatGPT app integrations, including DoorDash, Spotify, Uber, and othersTechCrunch AINvidia’s AI Boom Faces Taiwan Supply Risk - NVIDIA (NASDAQ:NVDA), Taiwan Semiconductor (NYSE:TSM) - BenzingaGNews AI NVIDIAFrom Prompt Engineering to Harness Engineering: The Next Evolution of LLM SystemsTowards AIAdvancing Responsible AI Adoption and Use in the Public Sector: Three Policy Priorities for State LegislationCenter for Democracy & TechnologyAI Agents Summit SeattleAI YouTube Channel 24
AI NEWS HUBbyEIGENVECTOREigenvector

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

HuggingFace PapersApril 3, 20268 min read0 views
Source Quiz

A vision-language model fusion framework combines contrastive and self-supervised visual encoders using entropy-guided aggregation and RoPE-enhanced attention to improve visual understanding and grounding tasks. (0 upvotes on HuggingFace)

Published on Apr 3

Authors:

,

,

,

,

,

Abstract

A vision-language model fusion framework combines contrastive and self-supervised visual encoders using entropy-guided aggregation and RoPE-enhanced attention to improve visual understanding and grounding tasks.

AI-generated summary

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

View arXiv page View PDF Project page GitHub 6 Add to collection

Get this paper in your agent:

hf papers read 2604.03231

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.03231 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.03231 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
CoME-VL: Sc…researchpaperarxivvision-lang…contrastive…self-superv…HuggingFace…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 265 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers