Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AII gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIBuilding a Resume & Portfolio Platform with Next.js and ReactDev.to AIWhy AI-Powered Ecommerce Website Development Is the New Competitive Edge in 2026Dev.to AIFAQs on Visionary AI: Transforming the Future of InnovationDev.to AIDid AMD Just Beat Nvidia In AI Performance? - ForbesGNews AI NVIDIANvidia and Google are the safest AI bets in public markets: Intelligent Alpha CEO Doug Clinton - CNBCGNews AI NVIDIAOnly 20% of MCP Servers Are 'A-Grade' Secure — Here's How to Vet Them Before InstallingDev.to AIThe Senior Engineer's Guide to CLAUDE.md: From Generic to ActionableDev.to AIBlack Hat USADark ReadingBlack Hat AsiaAI BusinessNvidia Stock Rises. This Issue Could Hamper Its Next-Generation AI Chips. - Barron'sGNews AI NVIDIABroadcom's CEO Has Line of Sight to $100 Billion in AI Chip Revenue. Is the Stock a Buy? - The Motley FoolGoogle News: AII gave Claude Code our entire codebase. Our customers noticed. | Al Chen (Galileo)lennysnewsletter.comGoogle DeepMind and Agile Robotics Combine Robotics Platforms - Automation WorldGoogle News: DeepMindBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIBuilding a Resume & Portfolio Platform with Next.js and ReactDev.to AIWhy AI-Powered Ecommerce Website Development Is the New Competitive Edge in 2026Dev.to AIFAQs on Visionary AI: Transforming the Future of InnovationDev.to AIDid AMD Just Beat Nvidia In AI Performance? - ForbesGNews AI NVIDIANvidia and Google are the safest AI bets in public markets: Intelligent Alpha CEO Doug Clinton - CNBCGNews AI NVIDIAOnly 20% of MCP Servers Are 'A-Grade' Secure — Here's How to Vet Them Before InstallingDev.to AIThe Senior Engineer's Guide to CLAUDE.md: From Generic to ActionableDev.to AI
AI NEWS HUBbyEIGENVECTOREigenvector

CARV: A Diagnostic Benchmark for Compositional Analogical Reasoning in Multimodal LLMs

arXivby [Submitted on 30 Mar 2026]March 31, 20261 min read1 views
Source Quiz

arXiv:2603.27958v1 Announce Type: new Abstract: Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, whi — Yongkang Du, Xiaohan Zou, Minhao Cheng, Lu Lin

View PDF HTML (experimental)

Abstract:Analogical reasoning tests a fundamental aspect of human cognition: mapping the relation from one pair of objects to another. Existing evaluations of this ability in multimodal large language models (MLLMs) overlook the ability to compose rules from multiple sources, a critical component of higher-order intelligence. To close this gap, we introduce CARV (Compositional Analogical Reasoning in Vision), a novel task together with a 5,500-sample dataset as the first diagnostic benchmark. We extend the analogy from a single pair to multiple pairs, which requires MLLMs to extract symbolic rules from each pair and compose new transformations. Evaluation on the state-of-the-art MLLMs reveals a striking performance gap: even Gemini-2.5 Pro achieving only 40.4% accuracy, far below human-level performance of 100%. Diagnostic analysis shows two consistent failure modes: (1) decomposing visual changes into symbolic rules, and (2) maintaining robustness under diverse or complex settings, highlighting the limitations of current MLLMs on this task.

Subjects:

Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.27958 [cs.AI]

(or arXiv:2603.27958v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.27958

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yongkang Du [view email] [v1] Mon, 30 Mar 2026 02:22:30 UTC (708 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
CARV: A Dia…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 204 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers