Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechA Differentiable Programming System to Bridge Machine Learning and ScientificComputingDev.to AIWhy Your AI Copilot Builds the Wrong Thing (And How to Fix It)Dev.to AIBuilding Sentinel Gate: A 3-Layer Security Pipeline for AI AgentsDev.to AIHow I Built a Self-Healing Memory System for AI AgentsDev.to AIAXIOM Week 2 Wrap — 65 Articles, 73 Cold Emails, and a New Agent on the NetworkDev.to AIA Feature I Never Planned Emerged From Persona Interviews — Here's Exactly HowDev.to AIThe Algorithmic Edge: Launching Your Day Trading Journey with AI Sentiment and Next-Gen ChartingDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIGemma 4: Byte for byte, the most capable open modelsGoogle DeepMindGoogle now lets you direct avatars through prompts in its Vids appTechCrunch AIGoogle releases Gemma 4, a family of open models built off of Gemini 3EngadgetBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThis International Fact-Checking Day, use these 5 tips to spot AI-generated contentFast Company TechA Differentiable Programming System to Bridge Machine Learning and ScientificComputingDev.to AIWhy Your AI Copilot Builds the Wrong Thing (And How to Fix It)Dev.to AIBuilding Sentinel Gate: A 3-Layer Security Pipeline for AI AgentsDev.to AIHow I Built a Self-Healing Memory System for AI AgentsDev.to AIAXIOM Week 2 Wrap — 65 Articles, 73 Cold Emails, and a New Agent on the NetworkDev.to AIA Feature I Never Planned Emerged From Persona Interviews — Here's Exactly HowDev.to AIThe Algorithmic Edge: Launching Your Day Trading Journey with AI Sentiment and Next-Gen ChartingDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIGemma 4: Byte for byte, the most capable open modelsGoogle DeepMindGoogle now lets you direct avatars through prompts in its Vids appTechCrunch AIGoogle releases Gemma 4, a family of open models built off of Gemini 3Engadget
AI NEWS HUBbyEIGENVECTOREigenvector

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

arXiv cs.CVby Ryosuke Matsuda, Keito Kudo, Haruto Yoshida, Nobuyuki Shimizu, Jun SuzukiApril 1, 20261 min read0 views
Source Quiz

arXiv:2603.29186v1 Announce Type: new Abstract: This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation

View PDF HTML (experimental)

Abstract:This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

Comments: Accepted to CVPR 2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29186 [cs.CV]

(or arXiv:2603.29186v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.29186

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Ryosuke Matsuda [view email] [v1] Tue, 31 Mar 2026 02:51:30 UTC (3,593 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

benchmarkannouncevaluation

Knowledge Map

Knowledge Map
TopicsEntitiesSource
SLVMEval: S…benchmarkannouncevaluationpaperarxivarXiv cs.CV

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 159 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!