Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessBuild a Price Comparison Tool in 15 Minutes with the Marketplace Price APIDEV CommunityKubernetes - A Beginner's Guide to Container OrchestrationDEV Community5 Free Copilot Alternatives That Actually Work in 2026DEV CommunityCodiumAI vs Codium (Open Source): They Are NOT the SameDEV CommunityHow Bifrost Reduces GPT Costs and Response Times with Semantic CachingDEV Community[New Research] You need Slack to be an effective agentLessWrong AIAn interview with Galen Buckwalter, a BCI recipient in a Caltech brain implant study, on his recent ability to use the implant to produce musical tones (Emily Mullin/Wired)TechmemeA startup founder explains why she built 9 AI employees: 'I am a breathless OpenClaw bro'Business InsiderTop 5 Enterprise AI Gateways to Track Claude Code CostsDEV CommunityAntigravity: My Approach to Deliver the Most Assured Value for the Least MoneyDEV CommunityTrading My Body for Logic: The Physical Decay We IgnoreDEV CommunityGetting Started with Apache Kafka: What I Learned Building Event-Driven Microservices at EricssonDEV CommunityBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessBuild a Price Comparison Tool in 15 Minutes with the Marketplace Price APIDEV CommunityKubernetes - A Beginner's Guide to Container OrchestrationDEV Community5 Free Copilot Alternatives That Actually Work in 2026DEV CommunityCodiumAI vs Codium (Open Source): They Are NOT the SameDEV CommunityHow Bifrost Reduces GPT Costs and Response Times with Semantic CachingDEV Community[New Research] You need Slack to be an effective agentLessWrong AIAn interview with Galen Buckwalter, a BCI recipient in a Caltech brain implant study, on his recent ability to use the implant to produce musical tones (Emily Mullin/Wired)TechmemeA startup founder explains why she built 9 AI employees: 'I am a breathless OpenClaw bro'Business InsiderTop 5 Enterprise AI Gateways to Track Claude Code CostsDEV CommunityAntigravity: My Approach to Deliver the Most Assured Value for the Least MoneyDEV CommunityTrading My Body for Logic: The Physical Decay We IgnoreDEV CommunityGetting Started with Apache Kafka: What I Learned Building Event-Driven Microservices at EricssonDEV Community

GEditBench v2: A Human-Aligned Benchmark for General Image Editing

arXivMarch 31, 20262 min read0 views
Source Quiz

arXiv:2603.28547v1 Announce Type: new Abstract: Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated ope — Zhangqi Jiang, Zheng Sun, Xianfang Zeng, Yufeng Yang, Xuanyang Zhang, Yongliang Wu, Wei Cheng, Gang Yu, Xu Yang, Bihan Wen

View PDF HTML (experimental)

Abstract:Recent advances in image editing have enabled models to handle complex instructions with impressive realism. However, existing evaluation frameworks lag behind: current benchmarks suffer from narrow task coverage, while standard metrics fail to adequately capture visual consistency, i.e., the preservation of identity, structure and semantic coherence between edited and original images. To address these limitations, we introduce GEditBench v2, a comprehensive benchmark with 1,200 real-world user queries spanning 23 tasks, including a dedicated open-set category for unconstrained, out-of-distribution editing instructions beyond predefined tasks. Furthermore, we propose PVC-Judge, an open-source pairwise assessment model for visual consistency, trained via two novel region-decoupled preference data synthesis pipelines. Besides, we construct VCReward-Bench using expert-annotated preference pairs to assess the alignment of PVC-Judge with human judgments on visual consistency evaluation. Experiments show that our PVC-Judge achieves state-of-the-art evaluation performance among open-source models and even surpasses GPT-5.1 on average. Finally, by benchmarking 16 frontier editing models, we show that GEditBench v2 enables more human-aligned evaluation, revealing critical limitations of current models, and providing a reliable foundation for advancing precise image editing.

Comments: 30 pages, 24 figures

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.28547 [cs.CV]

(or arXiv:2603.28547v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.28547

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhangqi Jiang [view email] [v1] Mon, 30 Mar 2026 15:08:32 UTC (26,166 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Knowledge Map

Knowledge Map
TopicsEntitiesSource
GEditBench …researchpaperarxivcomputer-vi…image-recog…arXiv

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 219 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Research Papers