Models model language model benchmark training announce valuation

MA-SAPO: Multi-Agent Reasoning for Score-Aware Prompt Optimization

arXiv cs.MAby Wonduk Seo, Juhyeon Lee, Junseo Koh, Wonseok Choi, Hyunjin An, Jian Park, Seunghyun lee, Haihua Chen, Yi BuApril 1, 20261 min read0 views

Source Quiz

arXiv:2510.16635v2 Announce Type: replace Abstract: Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revisi

View PDF HTML (experimental)

Abstract:Prompt optimization has become a practical way to improve the performance of Large Language Models (LLMs) without retraining. However, most existing frameworks treat evaluation as a black box, relying solely on outcome scores without explaining why prompts succeed or fail. Moreover, they involve repetitive trial-and-error refinements that remain implicit, offering limited interpretability or actionable guidance for systematic improvement. In this paper, we propose MA-SAPO: a new Multi-Agent Reasoning for Score Aware Prompt Optimization framework that links evaluation outcomes directly to targeted refinements. Specifically, in the Training Phase, multiple agents interpret evaluation scores, diagnose weaknesses, and generate concrete revision directives, which are stored as reusable reasoning assets. In the Test Phase, an analyzer agent retrieves relevant exemplars and assets for a new prompt, and a refiner agent applies evidence-based edits to improve the prompt and its response. By grounding optimization in structured reasoning, MA-SAPO ensures edits are interpretable, auditable, and controllable. Experiments on the HelpSteer1/2 benchmarks show that our framework consistently outperforms single-pass prompting, retrieval-augmented generation, and prior multi-agent methods across multiple evaluation metrics.

Comments: Preprint

Subjects:

Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)

Cite as: arXiv:2510.16635 [cs.MA]

(or arXiv:2510.16635v2 [cs.MA] for this version)

https://doi.org/10.48550/arXiv.2510.16635

arXiv-issued DOI via DataCite

Submission history

From: Wonduk Seo [view email] [v1] Sat, 18 Oct 2025 20:21:09 UTC (1,304 KB) [v2] Mon, 30 Mar 2026 21:53:02 UTC (3,435 KB)

Original source

arXiv cs.MA

https://arxiv.org/abs/2510.16635

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ProductsLive

Stop Prompting; Use the Design-Log Method to Build Predictable Tools

The article by Yoav Abrahami introduces the Design-Log Methodology, a structured approach to using AI in software development that combats the "context wall" — where AI models lose track of project history and make inconsistent decisions as codebases grow. The core idea is to maintain a version-controlled ./design-log/ folder in a Git repository, filled with markdown documents that capture design decisions, discussions, and implementation plans at the time they were made. This log acts as a shared brain between the developer and the AI, enabling the AI to act as a collaborative architect rather than just a code generator. By enforcing rules like read before you write, design before implementation, and immutable history, the methodology ensures consistency, reduces errors, and makes AI-assi

DEV Community

4mabout 1 hour ago

AI ToolsLive

A beginner's guide to the Nano-Banana-2 model by Google on Replicate

This is a simplified guide to an AI model called Nano-Banana-2 maintained by Google . If you like these kinds of analysis, you should join AImodels.fyi or follow us on Twitter . Model overview nano-banana-2 is Google's fast image generation model built for speed and quality. It combines conversational editing capabilities with multi-image fusion and character consistency, making it a versatile tool for creative projects. Compared to nano-banana-pro , this version offers a balance between performance and resource efficiency. The model also supports real-time grounding through Google Web Search and Image Search, allowing it to generate images based on current events and visual references from the internet. Model inputs and outputs The model accepts text prompts along with optional reference

DEV Community

2mabout 1 hour ago

ProductsLive

The $200 Billion Wait: How Outdated Banking Rails Are Strangling the Global Workforce

The Scene It’s 4:45 PM in Singapore on a Friday. The CFO of a Series B AI startup has just clicked “approve” on the month’s payroll. Her team of 47 is scattered across 12 countries: core engineers in Bangalore, prompt specialists in Warsaw, a compliance lead in Mexico City, and a newly hired head of growth in Lagos. The company’s runway is tight, and morale is fragile. She knows, with a sinking feeling, that the $187,000 she just released won’t land in her team’s accounts for 3 to 5 business days. For the engineer in Nigeria, where weekend banking is a fiction, it could be next Wednesday. She’s just authorized the payments, but she’s lost all control. The money is now in a labyrinth of correspondent banks, each taking a cut and adding a delay, with zero transparency. One employee will inev

DEV Community

7m34 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 165 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsLive

[D] How to break free from LLM's chains as a PhD student?

I didn't realize but over a period of one year i have become overreliant on ChatGPT to write code, I am a second year PhD student and don't want to end up as someone with fake "coding skills" after I graduate. I hear people talk about it all the time that use LLM to write boring parts of the code, and write core stuff yourself, but the truth is, LLMs are getting better and better at even writing those parts if you write the prompt well (or at least give you a template that you can play around to cross the finish line). Even PhD advisors are well convinced that their students are using LLMs to assist in research work, and they mentally expect quicker results. I am currently trying to cope with imposter syndrome because my advisor is happy with my progress. But deep down I know that not 100%