Models model benchmark announce valuation alignment multimodal

M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency

arXiv cs.CLby Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon LeeApril 4, 20261 min read0 views

arXiv:2604.01306v1 Announce Type: new Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8\% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6\% on high-comp

View PDF HTML (experimental)

Abstract:Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset's utility and provide comprehensive usage guidelines.

Comments: Preprint. Under Review

Subjects:

Computation and Language (cs.CL)

Cite as: arXiv:2604.01306 [cs.CL]

(or arXiv:2604.01306v1 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2604.01306

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Abolfazl Ansari [view email] [v1] Wed, 1 Apr 2026 18:18:10 UTC (5,260 KB)

Original source

arXiv cs.CL

https://arxiv.org/abs/2604.01306

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkannounce

Open Source AILive

quarkus-chat-ui: A Web Front-End for LLMs, and a Real-World Case for POJO-actor

Note: This article was originally published on SciVicsLab . quarkus-chat-ui: A Web Front-End for LLMs, and a Real-World Case for POJO-actor quarkus-chat-ui is a web UI for LLMs where multiple instances can talk to each other — built as a real-world use case for POJO-actor . Each quarkus-chat-ui instance exposes an HTTP MCP server at /mcp , so Instance A can call tools on Instance B, and Instance B can reply by calling tools back on A. The LLM backend — Claude Code CLI, Codex, or a local model via claw-code-local — acts as an MCP client that can reach these endpoints. The question was how to wire that up over HTTP, and how to handle the fact that LLM responses take tens of seconds and arrive as a stream. quarkus-chat-ui is the bridge that makes this work. Each instance wraps one LLM backend

DEV Community

10mabout 1 hour ago

Open Source AILive

I'm under 18, broke, and I just designed an open-source AI chip. Here's the full story.

I don't have a team. I don't have funding. I don't have a lab. I have a laptop, an internet connection, and an obsession with chips. This is the story of T1C — Tier 1 Chip — and why I built it. It started with a frustration. Every time I read about AI hardware, it was the same story. NVIDIA charges $30,000 for an H100. TSMC charges millions for a custom fab run. Apple Silicon is beautiful but completely closed. Intel, Qualcomm, AMD — all of them — locked behind NDAs, closed architectures, and billion-dollar relationships. I kept thinking: why does no one make an open-source AI chip that a real person can actually fabricate? Not a toy. Not a demo. A real architecture with real specs, real physics, and a real path to silicon. So I built one. T1C uses Digital In-Memory Computing — D-IMC. Inst

DEV Community

5mabout 1 hour ago

ProductsLive

Qodo vs Cody (Sourcegraph): AI Code Review Compared (2026)

Quick Verdict Qodo and Sourcegraph Cody are both AI tools for software teams, but they solve fundamentally different problems. Qodo is a code quality platform - it reviews pull requests automatically, finds bugs through a multi-agent architecture, and generates tests to fill coverage gaps without being asked. Cody is a codebase-aware AI coding assistant - it understands your entire repository and helps developers navigate, generate, and understand code through conversation and inline completions. Choose Qodo if: your team needs automated PR review that runs on every pull request without prompting, you want proactive test generation that closes coverage gaps systematically, you work on GitLab or Azure DevOps alongside GitHub, or the open-source transparency of PR-Agent matters to your organ

DEV Community

34m43 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 158 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

# Understanding Data Modeling in PowerBI: Joins, Relationship and Schemas.

Introduction. Data Modeling is a good backbone of making Data Science and Analytics in PowerBI very successful. We are going to look into: Data modeling fundamentals. SQL Joins. PowerBI relationships. Facts vs deimension tables. Data schema. The big Question; WHAT is data modeling. Well Data modeling is the process of organizing and structuring data from multiple sources into a logical format for analysis. Here it is all about: The Table connection. Defining relationships. Structuring data into facts and dimension tables. Optimizing performance. Just imagine all this like making your data easy and smooth to read and interpret. SQL joins. This is basically where we use joins to make data jointed from multiple tables using a common column. 1. INNER JOIN. This only returns matching records in

DEV Community

3m41 minutes ago

ModelsFresh

[D] Offering licensed Indian language speech datasets (with explicit contributor consent)

Hi everyone, I run a small data initiative where we collect speech datasets in multiple Indian languages directly from contributors who provide explicit consent for their recordings to be used and licensed. We can provide datasets with either exclusive or non-exclusive rights depending on the use case. The goal is to make ethically sourced speech data available for teams working on ASR, TTS, voice AI, or related research. If anyone here is working on speech models and might be looking for Indian language audio data, feel free to reach out. Happy to share more details about the datasets and collection process. — Divyam Founder, DataCatalyst datacatalyst.in submitted by /u/Trick-Praline6688 [link] [comments]

Reddit r/MachineLearning

1mabout 5 hours ago

Models

Lenovo licenses fast, private image-generation model developed through Surrey collaboration - EurekAlert!

Lenovo licenses fast, private image-generation model developed through Surrey collaboration EurekAlert!

GNews AI diffusion

1mabout 1 month ago

ModelsFresh

Unnoticed Gemma-4 Feature - it admits that it does not now...

Edit: "it admits that it does not know" (sorry for the TYPO!) Although Qwen3.5 is a great series of models, it is prone to make very broad assumptions/hallucinate stuff and it does it with a great confidence, so you may believe what it says. In contrast, Gemma-4 (specifically I tested E4b Q8 version) admits that it does not know right at the start of conversation: Therefore, I cannot confirm familiarity with a single, specific research study by that name. However, I am generally familiar with the factors that researchers and military trainers study regarding attrition in elite training programs... That is very important feature and it may hint to changing model training routine, where admitting to not know stuff is penalized less than trying to guess and then fail. submitted by /u/mtomas7

Reddit r/LocalLLaMA

1mabout 12 hours ago