Zero-Shot Coordination in Ad Hoc Teams with Generalized Policy Improvement and Difference Rewards
arXiv:2510.16187v2 Announce Type: replace Abstract: Real-world multi-agent systems may require ad hoc teaming, where an agent must coordinate with other previously unseen teammates to solve a task in a zero-shot manner. Prior work often either selects a pretrained policy based on an inferred model of the new teammates or pretrains a single policy that is robust to potential teammates. Instead, we propose to leverage all pretrained policies in a zero-shot transfer setting. We formalize this problem as an ad hoc multi-agent Markov decision process and present a solution that uses two key ideas, generalized policy improvement and difference rewards, for efficient and effective knowledge transfer between different teams. We empirically demonstrate that our algorithm, Generalized Policy improve
View PDF HTML (experimental)
Abstract:Real-world multi-agent systems may require ad hoc teaming, where an agent must coordinate with other previously unseen teammates to solve a task in a zero-shot manner. Prior work often either selects a pretrained policy based on an inferred model of the new teammates or pretrains a single policy that is robust to potential teammates. Instead, we propose to leverage all pretrained policies in a zero-shot transfer setting. We formalize this problem as an ad hoc multi-agent Markov decision process and present a solution that uses two key ideas, generalized policy improvement and difference rewards, for efficient and effective knowledge transfer between different teams. We empirically demonstrate that our algorithm, Generalized Policy improvement for Ad hoc Teaming (GPAT), successfully enables zero-shot transfer to new teams in three simulated environments: cooperative foraging, predator-prey, and Overcooked. We also demonstrate our algorithm in a real-world multi-robot setting.
Comments: 10 pages, 8 figures. To appear in proceedings of 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
Subjects:
Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Robotics (cs.RO)
Cite as: arXiv:2510.16187 [cs.MA]
(or arXiv:2510.16187v2 [cs.MA] for this version)
https://doi.org/10.48550/arXiv.2510.16187
arXiv-issued DOI via DataCite
Related DOI:
https://doi.org/10.65109/TNEX7143
DOI(s) linking to related resources
Submission history
From: Rupal Nigam [view email] [v1] Fri, 17 Oct 2025 19:55:25 UTC (1,805 KB) [v2] Tue, 31 Mar 2026 17:21:11 UTC (5,734 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelannouncepolicy![Considering NeurIPS submission [D]](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-robot-hand-JvPW6jsLFTCtkgtb97Kys5.webp)
Considering NeurIPS submission [D]
Wondering if it worth submitting paper I’m working on to NeurIPS. I have formal mathematical proof for convergence of a novel agentic system plus a compelling application to a real world use case. The problem is I just have a couple examples. I’ve tried working with synthetic data and benchmarks but no existing benchmarks captures the complexity of the real world data for any interesting results. Is it worth submitting or should I hold on to it until I can build up more data? submitted by /u/Clean-Baseball3748 [link] [comments]
![[R], 31 MILLIONS High frequency data, Light GBM worked perfectly](https://d2xsxph8kpxj0f.cloudfront.net/310419663032563854/konzwo8nGf8Z4uZsMefwMr/default-img-neural-network-P6fqXULWLNUwjuxqUZnB3T.webp)
[R], 31 MILLIONS High frequency data, Light GBM worked perfectly
We just published a paper on predicting adverse selection in high-frequency crypto markets using LightGBM , and I wanted to share it here because the findings are directly relevant to anyone dealing high frequency data and machine learning The core problem we solved: Every market maker's nightmare — getting picked off by informed traders right before a big move. We built a model that flags those toxic seconds before they wreck you. The data: - 31,081,463 second-level observations of BTC/USDT perpetual futures on Bybit - February 2025 → February 2026 (381 raw daily files) - Strict walk-forward regime, zero lookahead bias The key results (this is the part that shocked us): Our TailScore metric — which combines predicted toxicity probability with predicted price move severity — flags the top
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!