Research Papers research paper arxiv computer-vision image-recognition

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

arXivMarch 30, 202610 min read0 views

arXiv:2603.26174v1 Announce Type: new Abstract: Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed — Chonghuinan Wang, Zihan Chen, Yuxiang Wei, Tianyi Jiang, Xiaohe Wu, Fan Li, Wangmeng Zuo, Hongxun Yao

View PDF HTML (experimental)

Abstract:Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To address this gap, we propose CREval, a fully automated question-answer (QA)-based evaluation pipeline that overcomes the incompleteness and poor interpretability of opaque Multimodal Large Language Models (MLLMs) scoring. Simultaneously, we introduce CREval-Bench, a comprehensive benchmark specifically designed for creative image manipulation under complex instructions. CREval-Bench covers three categories and nine creative dimensions, comprising over 800 editing samples and 13K evaluation queries. Leveraging this pipeline and benchmark, we systematically evaluate a diverse set of state-of-the-art open and closed-source models. The results reveal that while closed-source models generally outperform open-source ones on complex and creative tasks, all models still struggle to complete such edits effectively. In addition, user studies demonstrate strong consistency between CREval's automated metrics and human judgments. Therefore, CREval provides a reliable foundation for evaluating image editing models on complex and creative image manipulation tasks, and highlights key challenges and opportunities for future research.

Comments: Accepted by CVPR2026

Subjects:

Computer Vision and Pattern Recognition (cs.CV)

Cite as: arXiv:2603.26174 [cs.CV]

(or arXiv:2603.26174v1 [cs.CV] for this version)

https://doi.org/10.48550/arXiv.2603.26174

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Chonghuinan Wang [view email] [v1] Fri, 27 Mar 2026 08:42:09 UTC (37,557 KB)

Original source

arXiv

https://arxiv.org/abs/2603.26174

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

researchpaperarxiv

Research Papers

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

As Large Language Models (LLMs) increasingly power decision-making systems across critical domains, understanding and mitigating their biases becomes essential for responsible AI deployment. Although bias assessment frameworks have proliferated for attributes such as race and gender, socioeconomic status bias remains significantly underexplored despite its widespread implications in the real world. We introduce SocioEval, a template-based framework for systematically evaluating socioeconomic bias in foundation models through decision-making tasks. Our hierarchical framework encompasses 8 theme — Divyanshu Kumar, Ishita Gupta, Nitin Aravind Birur

arXiv

10m4 days ago

Research Papers

Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

Large language models (LLMs) often exhibit sycophancy: agreement with user stance even when it conflicts with the model's opinion. While prior work has mostly studied this in single-agent settings, it remains underexplored in collaborative multi-agent systems. We ask whether awareness of other agents' sycophancy levels influences discussion outcomes. To investigate this, we run controlled experiments with six open-source LLMs, providing agents with peer sycophancy rankings that estimate each peer's tendency toward sycophancy. These rankings are based on scores calculated using various static ( — Vira Kasprova, Amruta Parulekar, Abdulrahman AlRabah

arXiv

10m3 days ago

Research Papers

Revealing the Learning Dynamics of Long-Context Continual Pre-training

Existing studies on Long-Context Continual Pre-training (LCCP) mainly focus on small-scale models and limited data regimes (tens of billions of tokens). We argue that directly migrating these small-scale settings to industrial-grade models risks insufficient adaptation and premature training termination. Furthermore, current evaluation methods rely heavily on downstream benchmarks (e.g., Needle-in-a-Haystack), which often fail to reflect the intrinsic convergence state and can lead to "deceptive saturation". In this paper, we present the first systematic investigation of LCCP learning dynamics — Yupu Liang, Shuang Chen, Guanwei Zhang

arXiv

10m4 days ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 266 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Research Papers

Research Papers

Revealing the Learning Dynamics of Long-Context Continual Pre-training

arXiv

10m4 days ago

Research Papers

Too Polite to Disagree: Understanding Sycophancy Propagation in Multi-Agent Systems

arXiv

10m3 days ago

Research Papers

SocioEval: A Template-Based Framework for Evaluating Socioeconomic Status Bias in Foundation Models

arXiv

10m4 days ago

Research Papers

Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection

Large reasoning models achieve strong performance on complex tasks through long chain-of-thought (CoT) trajectories, but directly transferring such reasoning processes to smaller models remains challenging. A key difficulty is that not all teacher-generated reasoning trajectories are suitable for student learning. Existing approaches typically rely on post-hoc filtering, selecting trajectories after full generation based on heuristic criteria. However, such methods cannot control the generation process itself and may still produce reasoning paths that lie outside the student's learning capacit — Chaoqun He, Yingfa Chen, Chaojun Xiao

arXiv

10m3 days ago