Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessVultr says its Nvidia-powered AI infrastructure costs 50% to 90% less than hyperscalersThe New StackDeepseek v4 will reportedly run entirely on Huawei chips in a major win for China s AI independence pushThe DecoderHow to Make AI Work When You Don’t Have Big Tech MoneyTowards AIToshiba starts shipping SMR MAMR enterprise hard drives offering up to 34TB of storageTechSpotMIT created duplicate AI workers to tackle thousands of different tasks. The verdict? Most of the time AI is still just minimally sufficientFortune TechAlgorithms of Falsehood: The Challenges of Governing AI-Generated Disinformation - orfonline.orgGoogle News: Generative AIThe Cathedral, the Bazaar, and the Winchester Mystery HouseO'Reilly RadareM Client Adds Generative AI Features - Let's Data ScienceGoogle News: Generative AItrunk/364516d4725fd98c0f2fd2301366409bc54cbc5b: [dtensor][pointwise_ops] removing dead code (#178975)PyTorch ReleasesThe fight on the right over AI - politico.euGNews AI USASources: Mercor asked professionals in fields like entertainment to sell their prior work materials for AI training, even if the IP could belong to ex-employers (Katherine Bindley/Wall Street Journal)TechmemeMarch Madness 2026: How to watch the Final FourEngadgetBlack Hat USADark ReadingBlack Hat AsiaAI BusinessVultr says its Nvidia-powered AI infrastructure costs 50% to 90% less than hyperscalersThe New StackDeepseek v4 will reportedly run entirely on Huawei chips in a major win for China s AI independence pushThe DecoderHow to Make AI Work When You Don’t Have Big Tech MoneyTowards AIToshiba starts shipping SMR MAMR enterprise hard drives offering up to 34TB of storageTechSpotMIT created duplicate AI workers to tackle thousands of different tasks. The verdict? Most of the time AI is still just minimally sufficientFortune TechAlgorithms of Falsehood: The Challenges of Governing AI-Generated Disinformation - orfonline.orgGoogle News: Generative AIThe Cathedral, the Bazaar, and the Winchester Mystery HouseO'Reilly RadareM Client Adds Generative AI Features - Let's Data ScienceGoogle News: Generative AItrunk/364516d4725fd98c0f2fd2301366409bc54cbc5b: [dtensor][pointwise_ops] removing dead code (#178975)PyTorch ReleasesThe fight on the right over AI - politico.euGNews AI USASources: Mercor asked professionals in fields like entertainment to sell their prior work materials for AI training, even if the IP could belong to ex-employers (Katherine Bindley/Wall Street Journal)TechmemeMarch Madness 2026: How to watch the Final FourEngadget
AI NEWS HUBbyEIGENVECTOREigenvector

CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

arXivby [Submitted on 20 Jan 2026 (v1), last revised 26 Mar 2026 (this version, v3)]March 30, 20262 min read1 views
Source Quiz

arXiv:2601.13622v3 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with align — Donghee Lee, Rui Cai, Zhe Zhao

View PDF HTML (experimental)

Abstract:Large vision-language models (LVLMs) are typically trained using autoregressive language modeling objectives, which align visual representations with linguistic space. While effective for multimodal reasoning, this alignment can weaken vision-centric capabilities, causing LVLMs to underperform their base vision encoders on tasks such as image classification. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a lightweight framework that integrates raw vision features with aligned LLM representations through vision-integration layers and a context-aware ensemble mechanism. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations. Extensive experiments demonstrate that CARPE improves performance on both image classification and diverse vision-language benchmarks. Our results suggest that modality balancing plays a critical role in multimodal generalization by improving representation utilization within autoregressive LVLMs.

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2601.13622 [cs.CV]

(or arXiv:2601.13622v3 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2601.13622

arXiv-issued DOI via DataCite

Submission history

From: Dong Hee Lee [view email] [v1] Tue, 20 Jan 2026 05:44:33 UTC (283 KB) [v2] Wed, 18 Mar 2026 04:41:36 UTC (278 KB) [v3] Thu, 26 Mar 2026 21:38:46 UTC (278 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
CARPE: Cont…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 172 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!