Models model language model benchmark announce policy autonomous

Design Principles for the Construction of a Benchmark Evaluating Security Operation Capabilities of Multi-agent AI Systems

arXiv cs.CRby [Submitted on 30 Mar 2026]April 1, 20262 min read1 views

arXiv:2603.28998v1 Announce Type: new Abstract: As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue

View PDF HTML (experimental)

Abstract:As Large Language Models (LLMs) and multi-agent AI systems are demonstrating increasing potential in cybersecurity operations, organizations, policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such AI systems to achieve more autonomous SOCs (security operation centers) and reduce manual effort. In particular, the AI and cybersecurity communities have recently developed several benchmarks for evaluating the red team capabilities of multi-agent AI systems. However, because the operations in SOCs are dominated by blue team operations, the capabilities of AI systems & agents to achieve more autonomous SOCs cannot be evaluated without a benchmark focused on blue team operations. To our best knowledge, no systematic benchmark for evaluating coordinated multi-task blue team AI has been proposed in the literature. Existing blue team benchmarks focus on a particular task. The goal of this work is to develop a set of design principles for the construction of a benchmark, which is denoted as SOC-bench, to evaluate the blue team capabilities of AI. Following these design principles, we have developed a conceptual design of SOC-bench, which consists of a family of five blue team tasks in the context of large-scale ransomware attack incident response.

Comments: 29 pages, 1 figure

Subjects:

Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)

ACM classes: K.6.5; I.2.11

Cite as: arXiv:2603.28998 [cs.CR]

(or arXiv:2603.28998v1 [cs.CR] for this version)

https://doi.org/10.48550/arXiv.2603.28998

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yicheng Cai [view email] [v1] Mon, 30 Mar 2026 21:01:00 UTC (135 KB)

Original source

arXiv cs.CR

https://arxiv.org/abs/2603.28998

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelbenchmark

ModelsLive

Please someone recommend me a good model for Linux Mint + 12 GB RAM + 3 GB VRAM + GTX 1050 setup.

Any good model?. I use AnythingLLM with Ollama API. There are good models, submitted by /u/Ok-Type-7663 [link] [comments]

Reddit r/LocalLLaMA

1m30 minutes ago

Open Source AIFresh

Help running Qwen3-Coder-Next TurboQuant (TQ3) model

I found a TQ3-quantized version of Qwen3-Coder-Next here: https://huggingface.co/edwardyoon79/Qwen3-Coder-Next-TQ3_0 According to the page, this model requires a compatible inference engine that supports TurboQuant. It also provides a command, but it doesn’t clearly specify which version or fork of llama.cpp should be used (or maybe I missed it). llama-server I’ve tried the following llama.cpp forks that claim to support TQ3, but none of them worked for me: https://github.com/TheTom/llama-cpp-turboquant https://github.com/turbo-tan/llama.cpp-tq3 https://github.com/drdotdot/llama.cpp-turbo3-tq3 If anyone has successfully run this model, I’d really appreciate it if you could share how you did it. submitted by /u/UnluckyTeam3478 [link] [comments]

Reddit r/LocalLLaMA

1mabout 6 hours ago

ModelsLive

Is Turboquant really a game changer?

I am currently utilizing qwen3.5 and Gemma 4 model. Realized Gemma 4 requires 2x ram for same context length. As far as I understand, what turbo quant gives is quantizing kv cache into about 4 bit and minimize the loses But Q8 still not lose the context that much so isn't kv cache ram for qwen 3.5 q8 and Gemma 4 truboquant is the same? Is turboquant also applicable in qwen's cache architecture? because as far as I know they didn't tested it in qwen3.5 style kv cache in their paper. Just curious, I started to learn local LLM recently submitted by /u/Interesting-Print366 [link] [comments]

Reddit r/LocalLLaMA

1mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 167 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

Anthropic Races to Contain Leak of Code Behind Claude AI Agent WSJ

GNews AI coding

1m3 days ago

ModelsLive

Is Turboquant really a game changer?

Reddit r/LocalLLaMA

1mabout 2 hours ago

ModelsLive

Please someone recommend me a good model for Linux Mint + 12 GB RAM + 3 GB VRAM + GTX 1050 setup.

Any good model?. I use AnythingLLM with Ollama API. There are good models, submitted by /u/Ok-Type-7663 [link] [comments]

Reddit r/LocalLLaMA

1m30 minutes ago

ModelsFresh

Claude Code replacement

I'm looking to build a local setup for coding since using Claude Code has been kind of poor experience last 2 weeks. I'm pondering between 2 or 4 V100 (32GB) and 2 or 4 MI50 (32GB) GPUs to support this. I understand V100 should be snappier to respond but MI50 is newer. What would be best way to go here? submitted by /u/NoTruth6718 [link] [comments]

Reddit r/LocalLLaMA

1mabout 3 hours ago