Models model benchmark announce available interpretability agent

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

arXiv cs.MAby [Submitted on 1 Apr 2026]April 2, 20262 min read1 views

arXiv:2604.01151v1 Announce Type: cross Abstract: As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structu

View PDF HTML (experimental)

Abstract:As LLM agents are increasingly deployed in multi-agent systems, they introduce risks of covert coordination that may evade standard forms of human oversight. While linear probes on model activations have shown promise for detecting deception in single-agent settings, collusion is inherently a multi-agent phenomenon, and the use of internal representations for detecting collusion between agents remains unexplored. We introduce NARCBench, a benchmark for evaluating collusion detection under environment distribution shift, and propose five probing techniques that aggregate per-agent deception scores to classify scenarios at the group level. Our probes achieve 1.00 AUROC in-distribution and 0.60--0.86 AUROC when transferred zero-shot to structurally different multi-agent scenarios and a steganographic blackjack card-counting task. We find that no single probing technique dominates across all collusion types, suggesting that different forms of collusion manifest differently in activation space. We also find preliminary evidence that this signal is localised at the token level, with the colluding agent's activations spiking specifically when processing the encoded parts of their partner's message. This work takes a step toward multi-agent interpretability: extending white-box inspection from single models to multi-agent contexts, where detection requires aggregating signals across agents. These results suggest that model internals provide a complementary signal to text-level monitoring for detecting multi-agent collusion, particularly for organisations with access to model activations. Code and data are available at this https URL.

Subjects:

Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)

Cite as: arXiv:2604.01151 [cs.AI]

(or arXiv:2604.01151v1 [cs.AI] for this version)

https://doi.org/10.48550/arXiv.2604.01151

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Aaron Rose [view email] [v1] Wed, 1 Apr 2026 17:08:05 UTC (93 KB)

Original source

arXiv cs.MA

https://arxiv.org/abs/2604.01151

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkannounce

ProductsFresh

Production RAG: From Anti-Patterns to Platform Engineering

RAG is a distributed system . It becomes clear when moving beyond demos into production. It consists of independent services such as ingestion, retrieval, inference, orchestration, and observability. Each component introduces its own latency, scaling characteristics, and failure modes, making coordination, observability, and fault tolerance essential. RAG flowchart In regulated environments such as banking, these systems must also satisfy strict governance, auditability, and change-control requirements aligned with standards like SOX and PCI DSS. This article builds on existing frameworks like 12 Factor Agents (Dex Horthy)¹ and Google’s 16 Factor App² by exploring key anti-patterns and introducing the pillars required to take a typical RAG pipeline to production. I’ve included code snippet

Towards AI

12mabout 4 hours ago

Self-Evolving AIFresh

Google DeepMind Just Mapped Every Way the Web Can Hijack Your AI Agent

The first systematic taxonomy of “Agent Traps” is here, and if you’re building agentic systems, you should be losing sleep. Continue reading on Towards AI »

Towards AI

1mabout 4 hours ago

ModelsFresh

Word2Vec Explained: The Moment Words Became Relations

How models first learned meaning from context — and why that changed everything In the first post, we built the base layer: Text → Tokens → Numbers → (lots of math) → Tokens → Text In the second post, we stayed with the deeper question: Once words become numbers, how does meaning not disappear? We saw that the answer is not “because numbers are magical.” The answer is this: the numbers are learned in a space that preserves relationships. That was the real story of embeddings. Now we are ready for the next step. Because once you accept that words can become numbers without losing meaning, the next question becomes unavoidable: How are those numbers actually learned? This is where Word2Vec enters the story. And Word2Vec matters for more than historical reasons. It was not just a clever neura

Towards AI

16mabout 4 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 313 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

TurboQuant Explained: Extreme AI Compression for Faster, Cheaper LLM Inference and Vector Search

If you’ve been following the “long-context” wave in AI, you’ve probably heard the same story: bigger context windows feel magical… until… Continue reading on Towards AI »

Towards AI

1mabout 4 hours ago

ModelsFresh

Claude Certified Architect: Master the CI/CD scenario for the CCA Foundations Exam — the flags…

CCA Exam Prep: Claude Code for CI/CD — Non-Interactive Pipelines, the -p Flag, and Machine-Parseable Output Continue reading on Towards AI »

Towards AI

1mabout 4 hours ago

ModelsFresh

Word2Vec Explained: The Moment Words Became Relations

Towards AI

16mabout 4 hours ago

ModelsLive

Chinese AI rivals clash over Anthropic’s OpenClaw exit amid global token crunch

Chinese tech companies are engaged in a public war of words as they compete to capitalise on US start-up Anthropic’s decision to pull its industry-leading Claude models from open-source AI agent tool OpenClaw. The development comes as AI agents have triggered a huge increase in demand for AI tokens – the core metric of AI usage – raising questions about the long-term ability of industry players to meet this demand amid a growing global crunch in computational power. On Sunday, Anthropic...

SCMP Tech (Asia AI)

1m16 minutes ago