Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessTikTok ran ads for AI apps that let users undress strangersBusiness InsiderEnd of an era: Elon Musk says Tesla is no longer producing the Model S and XBusiness InsiderOpenAI's new partner wants to build ads that can chat with youBusiness InsiderQ1 2026 Shatters Venture Funding Records As AI Boom Pushes Startup Investment To Nearly $300BCrunchbase NewsMeet 'Dobby': The AI agent that could kill the app economyBusiness InsiderThis company is turning YouTube videos into TV shows as streamers chase Gen AlphaBusiness InsiderWhat to expect from WWDC 2026EngadgetThe gig workers who are training humanoid robots at homeMIT Technology Review AITech creators are getting the star treatment at a new talent management firmBusiness InsiderBaidu’s robotaxis froze in traffic creating chaosThe Verge AI9 companies that have done AI-related layoffsBusiness InsiderSlack's upgraded AI can analyze how you workEngadgetBlack Hat USADark ReadingBlack Hat AsiaAI BusinessTikTok ran ads for AI apps that let users undress strangersBusiness InsiderEnd of an era: Elon Musk says Tesla is no longer producing the Model S and XBusiness InsiderOpenAI's new partner wants to build ads that can chat with youBusiness InsiderQ1 2026 Shatters Venture Funding Records As AI Boom Pushes Startup Investment To Nearly $300BCrunchbase NewsMeet 'Dobby': The AI agent that could kill the app economyBusiness InsiderThis company is turning YouTube videos into TV shows as streamers chase Gen AlphaBusiness InsiderWhat to expect from WWDC 2026EngadgetThe gig workers who are training humanoid robots at homeMIT Technology Review AITech creators are getting the star treatment at a new talent management firmBusiness InsiderBaidu’s robotaxis froze in traffic creating chaosThe Verge AI9 companies that have done AI-related layoffsBusiness InsiderSlack's upgraded AI can analyze how you workEngadget

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

arXivMarch 31, 202610 min read0 views
Source Quiz

arXiv:2603.28407v1 Announce Type: new Abstract: Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 — Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing

Authors:Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing

View PDF HTML (experimental)

Abstract:Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

Comments: GitHub: this https URL

Subjects:

Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

Cite as: arXiv:2603.28407 [cs.AI]

(or arXiv:2603.28407v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2603.28407

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Lei Wang [view email] [v1] Mon, 30 Mar 2026 13:16:03 UTC (1,317 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
MiroEval: B…researchpaperarxivaiartificial-…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 170 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers