Live
Black Hat USADark ReadingBlack Hat AsiaAI Businessciflow/trunk/178793PyTorch Releasestrunk/43172938c77ce95e706aad37dd15fda0a909c66cPyTorch ReleasesI Rewrote Our Payment Gateway in Rust. Revenue Impact Surprised MeDEV Community🚀 Gudu SQL Omni Lineage Analysis — Directly Inside VS CodeDEV CommunityThe $200 Billion Wait: How Outdated Banking Rails Are Strangling the Global WorkforceDEV CommunityBuilding AI Visibility Infrastructure: The Technical Architecture Behind JonomorDEV CommunityAlma and Rocky Linux ISOs: DVD vs Boot vs MinimalDEV Community[D] How to break free from LLM's chains as a PhD student?Reddit r/MachineLearningShould Extreme Networks’ (EXTR) 400/800G and Agentic AI Pivot Prompt Action From Investors? - simplywall.stGNews AI agenticA beginner's guide to the Nano-Banana-2 model by Google on ReplicateDEV CommunityStop Prompting; Use the Design-Log Method to Build Predictable ToolsDEV CommunityAI Day 2026 seeks to leverage AI for socio-economic development - VOV WorldGoogle News - AI VietnamBlack Hat USADark ReadingBlack Hat AsiaAI Businessciflow/trunk/178793PyTorch Releasestrunk/43172938c77ce95e706aad37dd15fda0a909c66cPyTorch ReleasesI Rewrote Our Payment Gateway in Rust. Revenue Impact Surprised MeDEV Community🚀 Gudu SQL Omni Lineage Analysis — Directly Inside VS CodeDEV CommunityThe $200 Billion Wait: How Outdated Banking Rails Are Strangling the Global WorkforceDEV CommunityBuilding AI Visibility Infrastructure: The Technical Architecture Behind JonomorDEV CommunityAlma and Rocky Linux ISOs: DVD vs Boot vs MinimalDEV Community[D] How to break free from LLM's chains as a PhD student?Reddit r/MachineLearningShould Extreme Networks’ (EXTR) 400/800G and Agentic AI Pivot Prompt Action From Investors? - simplywall.stGNews AI agenticA beginner's guide to the Nano-Banana-2 model by Google on ReplicateDEV CommunityStop Prompting; Use the Design-Log Method to Build Predictable ToolsDEV CommunityAI Day 2026 seeks to leverage AI for socio-economic development - VOV WorldGoogle News - AI Vietnam
AI NEWS HUBbyEIGENVECTOREigenvector

MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization

arXiv cs.MAby Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An, Jian Park, Seunghyun lee, Haihua Chen, Yi BuApril 1, 20261 min read0 views
Source Quiz

arXiv:2510.16635v2 Announce Type: replace Abstract: Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revisi

View PDF HTML (experimental)

Abstract:Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revision directives, which are stored as reusable reasoning assets. In the Test Phase, an analyzer agent retrieves relevant exemplars and assets for a new prompt, and a refiner agent applies evidence-based edits to improve the prompt and its response. By grounding optimization in structured reasoning, MA-SAPO ensures edits are interpretable, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks show that our framework consistently outperforms single-pass prompting, retrieval-augmented generation, and prior multi-agent methods across multiple evaluation metrics.

Comments: Preprint

Subjects:

Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

Cite as: arXiv:2510.16635 [cs.MA]

(or arXiv:2510.16635v2 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2510.16635

arXiv-issued DOI via DataCite

Submission history

From: Wonduk Seo [view email] [v1] Sat, 18 Oct 2025 20:21:09 UTC (1,304 KB) [v2] Mon, 30 Mar 2026 21:53:02 UTC (3,435 KB)

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

Knowledge Map

Knowledge Map
TopicsEntitiesSource
MA-SAPO: Mu…modellanguage mo…benchmarktrainingannouncevaluationarXiv cs.MA

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!