Live
Black Hat USADark ReadingBlack Hat AsiaAI Business5 AI-powered consulting startups to watchBusiness InsiderWhat Teens Are Doing With Those Role-Playing Chatbots - The New York TimesGoogle News: AIOCSF explained: The shared data language security teams have been missingVentureBeat AIdark ilanlesswrong.comMicrosoft Is Going Multi-Model with Copilot. Does the Enterprise King Win Again? - The Motley FoolGNews AI MicrosoftShow HN: Running local OpenClaw together with remote agents in an open networkHacker NewsA folk musician became a target for AI fakes and a copyright trollThe Verge AIWhat Teens Are Doing With Those Role-Playing ChatbotsNYT TechnologyChicken-Free Egg Whiteslesswrong.comDesktop Canary v2.1.48-canary.35LobeChat ReleasesPlease someone recommend me a good model for Linux Mint + 12 GB RAM + 3 GB VRAM + GTX 1050 setup.Reddit r/LocalLLaMABest Artificial Intelligence Stocks To Add to Your Watchlist - April 4th - MarketBeatGoogle News: AIBlack Hat USADark ReadingBlack Hat AsiaAI Business5 AI-powered consulting startups to watchBusiness InsiderWhat Teens Are Doing With Those Role-Playing Chatbots - The New York TimesGoogle News: AIOCSF explained: The shared data language security teams have been missingVentureBeat AIdark ilanlesswrong.comMicrosoft Is Going Multi-Model with Copilot. Does the Enterprise King Win Again? - The Motley FoolGNews AI MicrosoftShow HN: Running local OpenClaw together with remote agents in an open networkHacker NewsA folk musician became a target for AI fakes and a copyright trollThe Verge AIWhat Teens Are Doing With Those Role-Playing ChatbotsNYT TechnologyChicken-Free Egg Whiteslesswrong.comDesktop Canary v2.1.48-canary.35LobeChat ReleasesPlease someone recommend me a good model for Linux Mint + 12 GB RAM + 3 GB VRAM + GTX 1050 setup.Reddit r/LocalLLaMABest Artificial Intelligence Stocks To Add to Your Watchlist - April 4th - MarketBeatGoogle News: AI
AI NEWS HUBbyEIGENVECTOREigenvector

AIBench: Evaluating Visual-Logical Consistency in Academic Illustration Generation

arXivMarch 31, 20262 min read1 views
Source Quiz

arXiv:2603.28068v1 Announce Type: new Abstract: Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely unexplored.Directly comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for asse — Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu

Authors:Zhaohe Liao, Kaixun Jiang, Zhihang Liu, Yujie Wei, Junqiu Yu, Quanhao Li, Hong-Tao Yu, Pandeng Li, Yuzheng Wang, Zhen Xing, Shiwei Zhang, Chen-Wei Xie, Yun Zheng, Xihui Liu

View PDF HTML (experimental)

Abstract:Although image generation has boosted various applications via its rapid evolution, whether the state-of-the-art models are able to produce ready-to-use academic illustrations for papers is still largely this http URL comparing or evaluating the illustration with VLM is native but requires oracle multi-modal understanding ability, which is unreliable for long and complex texts and illustrations. To address this, we propose AIBench, the first benchmark using VQA for evaluating logic correctness of the academic illustrations and VLMs for assessing aesthetics. In detail, we designed four levels of questions proposed from a logic diagram summarized from the method part of the paper, which query whether the generated illustration aligns with the paper on different scales. Our VQA-based approach raises more accurate and detailed evaluations on visual-logical consistency while relying less on the ability of the judger VLM. With our high-quality AIBench, we conduct extensive experiments and conclude that the performance gap between models on this task is significantly larger than general ones, reflecting their various complex reasoning and high-density generation ability. Further, the logic and aesthetics are hard to optimize simultaneously as in handcrafted illustrations. Additional experiments further state that test-time scaling on both abilities significantly boosts the performance on this task.

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28068 [cs.CV]

(or arXiv:2603.28068v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28068

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Kaixun Jiang [view email] [v1] Mon, 30 Mar 2026 06:14:40 UTC (11,070 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
AIBench: Ev…researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 164 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers