MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization
arXiv:2510.16635v2 Announce Type: replace Abstract: Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revisi
View PDF HTML (experimental)
Abstract:Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revision directives, which are stored as reusable reasoning assets. In the Test Phase, an analyzer agent retrieves relevant exemplars and assets for a new prompt, and a refiner agent applies evidence-based edits to improve the prompt and its response. By grounding optimization in structured reasoning, MA-SAPO ensures edits are interpretable, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks show that our framework consistently outperforms single-pass prompting, retrieval-augmented generation, and prior multi-agent methods across multiple evaluation metrics.
Comments: Preprint
Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
Cite as: arXiv:2510.16635 [cs.MA]
(or arXiv:2510.16635v2 [cs.MA] for this version)
https://doi.org/10.48550/arXiv.2510.16635
arXiv-issued DOI via DataCite
Submission history
From: Wonduk Seo [view email] [v1] Sat, 18 Oct 2025 20:21:09 UTC (1,304 KB) [v2] Mon, 30 Mar 2026 21:53:02 UTC (3,435 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelbenchmark
Stop Prompting; Use the Design-Log Method to Build Predictable Tools
The article by Yoav Abrahami introduces the Design-Log Methodology, a structured approach to using AI in software development that combats the "context wall" — where AI models lose track of project history and make inconsistent decisions as codebases grow. The core idea is to maintain a version-controlled ./design-log/ folder in a Git repository, filled with markdown documents that capture design decisions, discussions, and implementation plans at the time they were made. This log acts as a shared brain between the developer and the AI, enabling the AI to act as a collaborative architect rather than just a code generator. By enforcing rules like read before you write, design before implementation, and immutable history, the methodology ensures consistency, reduces errors, and makes AI-assi

A beginner's guide to the Nano-Banana-2 model by Google on Replicate
This is a simplified guide to an AI model called Nano-Banana-2 maintained by Google . If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter . Model overview nano-banana-2 is Google's fast image generation model built for speed and quality. It combines conversational editing capabilities with multi-image fusion and character consistency, making it a versatile tool for creative projects. Compared to nano-banana-pro , this version offers a balance between performance and resource efficiency. The model also supports real-time grounding through Google Web Search and Image Search, allowing it to generate images based on current events and visual references from the internet. Model inputs and outputs The model accepts text prompts along with optional reference

The $200 Billion Wait: How Outdated Banking Rails Are Strangling the Global Workforce
The Scene It’s 4:45 PM in Singapore on a Friday. The CFO of a Series B AI startup has just clicked “approve” on the month’s payroll. Her team of 47 is scattered across 12 countries: core engineers in Bangalore, prompt specialists in Warsaw, a compliance lead in Mexico City, and a newly hired head of growth in Lagos. The company’s runway is tight, and morale is fragile. She knows, with a sinking feeling, that the $187,000 she just released won’t land in her team’s accounts for 3 to 5 business days. For the engineer in Nigeria, where weekend banking is a fiction, it could be next Wednesday. She’s just authorized the payments, but she’s lost all control. The money is now in a labyrinth of correspondent banks, each taking a cut and adding a delay, with zero transparency. One employee will inev
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
![[D] How to break free from LLM's chains as a PhD student?](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-earth-satellite-QfbitDhCB2KjTsjtXRYcf9.webp)
[D] How to break free from LLM's chains as a PhD student?
I didn't realize but over a period of one year i have become overreliant on ChatGPT to write code, I am a second year PhD student and don't want to end up as someone with fake "coding skills" after I graduate. I hear people talk about it all the time that use LLM to write boring parts of the code, and write core stuff yourself, but the truth is, LLMs are getting better and better at even writing those parts if you write the prompt well (or at least give you a template that you can play around to cross the finish line). Even PhD advisors are well convinced that their students are using LLMs to assist in research work, and they mentally expect quicker results. I am currently trying to cope with imposter syndrome because my advisor is happy with my progress. But deep down I know that not 100%



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!