Research Papers research paper arxiv ai artificial-intelligence

CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

arXivby [Submitted on 20 Jan 2026 (v1), last revised 26 Mar 2026 (this version, v3)]March 30, 20262 min read1 views

arXiv:2601.13622v3 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with align — Donghee Lee, Rui Cai, Zhe Zhao

View PDF HTML (experimental)

Abstract:Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2601.13622 [cs.CV]

(or arXiv:2601.13622v3 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.13622

arXiv-issued DOI via DataCite

Submission history

From: Dong Hee Lee [view email] [v1] Tue, 20 Jan 2026 05:44:33 UTC (283 KB) [v2] Wed, 18 Mar 2026 04:41:36 UTC (278 KB) [v3] Thu, 26 Mar 2026 21:38:46 UTC (278 KB)

Original source

arXiv

https://arxiv.org/abs/2601.13622

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Releases

SMART launches new Wearable Imaging for Transforming Elderly Care research group

WITEC is working to develop the first wearable ultrasound imaging system to monitor chronic conditions in real-time, with the goal of enabling earlier detection and timely intervention.

MIT AI News

7mabout 2 months ago

Products

3 Questions: Using AI to accelerate the discovery and design of therapeutic drugs

Professor James Collins discusses how collaboration has been central to his research into combining computational predictions with new experimental platforms.