Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessHow AI and Alternative Data Are Finally Making Germany's Hidden Champions Accessible to Global InvestorsDev.to AIThe Simple Truth About AI Agent RevenueDev.to AIAI Transformation in German SMEs: McKinsey Data Shows Up to 10x ROI from Strategic AI IntegrationDev.to AIAutomating Your Urban Farm with AI: From Guesswork to PrecisionDev.to AIThe Real Ceiling in Claude Code's Memory System (It’s Not the 200-Line Cap)Dev.to AIThe Invisible Rhythms of the Siuntio FortDev.to AIXYRONIXDEV CommunityExploring RAG Embedding Techniques in DepthDev.to AIHow I Built a Multi-Agent Geopolitical Simulator with FastAPI + LiteLLMDev.to AI90% людей используют нейросети как поисковик. И проигрывают.Dev.to AII Let AI Coding Agents Build My Side Projects for a Month — Here's My Honest TakeDev.to AI# Understanding Data Modeling in PowerBI: Joins, Relationship and Schemas.DEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessHow AI and Alternative Data Are Finally Making Germany's Hidden Champions Accessible to Global InvestorsDev.to AIThe Simple Truth About AI Agent RevenueDev.to AIAI Transformation in German SMEs: McKinsey Data Shows Up to 10x ROI from Strategic AI IntegrationDev.to AIAutomating Your Urban Farm with AI: From Guesswork to PrecisionDev.to AIThe Real Ceiling in Claude Code's Memory System (It’s Not the 200-Line Cap)Dev.to AIThe Invisible Rhythms of the Siuntio FortDev.to AIXYRONIXDEV CommunityExploring RAG Embedding Techniques in DepthDev.to AIHow I Built a Multi-Agent Geopolitical Simulator with FastAPI + LiteLLMDev.to AI90% людей используют нейросети как поисковик. И проигрывают.Dev.to AII Let AI Coding Agents Build My Side Projects for a Month — Here's My Honest TakeDev.to AI# Understanding Data Modeling in PowerBI: Joins, Relationship and Schemas.DEV Community
AI NEWS HUBbyEIGENVECTOREigenvector

M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

arXiv cs.CLby Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon LeeApril 4, 20261 min read0 views
Source Quiz

arXiv:2604.01306v1 Announce Type: new Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-comp

View PDF HTML (experimental)

Abstract:Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.

Comments: Preprint. Under Review

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.01306 [cs.CL]

(or arXiv:2604.01306v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01306

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Abolfazl Ansari [view email] [v1] Wed, 1 Apr 2026 18:18:10 UTC (5,260 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkannounce

Knowledge Map

Knowledge Map
TopicsEntitiesSource
M2-Verify: …modelbenchmarkannouncevaluationalignmentmultimodalarXiv cs.CL

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 158 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models