Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
arXiv:2604.01151v1 Announce Type: cross Abstract: As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structu
View PDF HTML (experimental)
Abstract:As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at this https URL.
Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
Cite as: arXiv:2604.01151 [cs.AI]
(or arXiv:2604.01151v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2604.01151
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: Aaron Rose [view email] [v1] Wed, 1 Apr 2026 17:08:05 UTC (93 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarkannounce
Production RAG: From Anti-Patterns to Platform Engineering
RAG is a distributed system . It becomes clear when moving beyond demos into production. It consists of independent services such as ingestion, retrieval, inference, orchestration, and observability. Each component introduces its own latency, scaling characteristics, and failure modes, making coordination, observability, and fault tolerance essential. RAG flowchart In regulated environments such as banking, these systems must also satisfy strict governance, auditability, and change-control requirements aligned with standards like SOX and PCI DSS. This article builds on existing frameworks like 12 Factor Agents (Dex Horthy)¹ and Google’s 16 Factor App² by exploring key anti-patterns and introducing the pillars required to take a typical RAG pipeline to production. I’ve included code snippet

Word2Vec Explained: The Moment Words Became Relations
How models first learned meaning from context — and why that changed everything In the first post, we built the base layer: Text → Tokens → Numbers → (lots of math) → Tokens → Text In the second post, we stayed with the deeper question: Once words become numbers, how does meaning not disappear? We saw that the answer is not “because numbers are magical.” The answer is this: the numbers are learned in a space that preserves relationships. That was the real story of embeddings. Now we are ready for the next step. Because once you accept that words can become numbers without losing meaning, the next question becomes unavoidable: How are those numbers actually learned? This is where Word2Vec enters the story. And Word2Vec matters for more than historical reasons. It was not just a clever neura
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

TurboQuant Explained: Extreme AI Compression for Faster, Cheaper LLM Inference and Vector Search
If you’ve been following the “long-context” wave in AI, you’ve probably heard the same story: bigger context windows feel magical… until… Continue reading on Towards AI »

Word2Vec Explained: The Moment Words Became Relations
How models first learned meaning from context — and why that changed everything In the first post, we built the base layer: Text → Tokens → Numbers → (lots of math) → Tokens → Text In the second post, we stayed with the deeper question: Once words become numbers, how does meaning not disappear? We saw that the answer is not “because numbers are magical.” The answer is this: the numbers are learned in a space that preserves relationships. That was the real story of embeddings. Now we are ready for the next step. Because once you accept that words can become numbers without losing meaning, the next question becomes unavoidable: How are those numbers actually learned? This is where Word2Vec enters the story. And Word2Vec matters for more than historical reasons. It was not just a clever neura

Chinese AI rivals clash over Anthropic’s OpenClaw exit amid global token crunch
Chinese tech companies are engaged in a public war of words as they compete to capitalise on US start-up Anthropic’s decision to pull its industry-leading Claude models from open-source AI agent tool OpenClaw. The development comes as AI agents have triggered a huge increase in demand for AI tokens – the core metric of AI usage – raising questions about the long-term ability of industry players to meet this demand amid a growing global crunch in computational power. On Sunday, Anthropic...



Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!