Models model benchmark announce analysis policy reasoning

Improving Efficiency of GPU Kernel Optimization Agents using a Domain-Specific Language and Speed-of-Light Guidance

arXiv cs.LGby Siva Kumar Sastry Hari, Vignesh Balaji, Sana Damani, Qijing Huang, Christos KozyrakisApril 1, 20262 min read0 views

Source Quiz

arXiv:2603.29010v1 Announce Type: new Abstract: Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher l

View PDF HTML (experimental)

Abstract:Optimizing GPU kernels with LLM agents is an iterative process over a large design space. Every candidate must be generated, compiled, validated, and profiled, so fewer trials will save both runtime and cost. We make two key observations. First, the abstraction level that agents operate at is important. If it is too low, the LLM wastes reasoning on low-impact details. If it is too high, it may miss important optimization choices. Second, agents cannot easily tell when they reach the point of diminishing returns, wasting resources as they continue searching. These observations motivate two design principles to improve efficiency: (1) a compact domain-specific language (DSL) that can be learned in context and lets the model reason at a higher level while preserving important optimization levers, and (2) Speed-of-Light (SOL) guidance that uses first-principles performance bounds to steer and budget search. We implement these principles in $\mu$CUTLASS, a DSL with a compiler for CUTLASS-backed GPU kernels that covers kernel configuration, epilogue fusion, and multi-stage pipelines. We use SOL guidance to estimate headroom and guide optimization trials, deprioritize problems that are near SOL, and flag kernels that game the benchmark. On 59 KernelBench problems with the same iteration budgets, switching from generating low-level code to DSL code using GPT-5-mini turns a 0.40x geomean regression into a 1.27x speedup over PyTorch. Adding SOL-guided steering raises this to 1.56x. Across model tiers, $\mu$CUTLASS + SOL-guidance lets weaker models outperform stronger baseline agents at lower token cost. SOL-guided budgeting saves 19-43% of tokens while retaining at least 95% of geomean speedup, with the best policy reaching a 1.68x efficiency gain. Lastly, SOL analysis helps detect benchmark-gaming cases, where kernels may appear fast while failing to perform the intended computation.

Subjects:

Machine Learning (cs.LG); Artificial Intelligence (cs.AI)

Cite as: arXiv:2603.29010 [cs.LG]

(or arXiv:2603.29010v1 [cs.LG] for this version)

https://doi.org/10.48550/arXiv.2603.29010

arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Siva Kumar Sastry Hari [view email] [v1] Mon, 30 Mar 2026 21:16:39 UTC (6,047 KB)

Original source

arXiv cs.LG

https://arxiv.org/abs/2603.29010

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelbenchmarkannounce

Laws & RegulationLive

Intelligence Dissolves Privacy

The future is going to be different from the present. Let's think about how. Specifically, our expectations about what's reasonable are downstream of our past experiences, and those experiences were downstream of our options (and the options other people in our society had). As those options change, so too our experiences, and our expectations of what's reasonable. I once thought it was reasonable to pick up the phone and call someone, and to pick up my phone when it rang; things have changed, and someone thinking about what's possible could have seen it coming. So let's try to see more things coming, and maybe that will give us the ability to choose what it will actually look like. I think lots of people's intuitions and expectations about "privacy" will be violated, as technology develop

LessWrong AI

12mabout 1 hour ago

AI ToolsLive

Anthropic Just Mapped the Jobs AI Is Replacing First - Here's What the Data Actually Says

Anthropic published something last month that didn't get nearly enough attention: a detailed map of which white-collar jobs AI is most likely to displace first. Software engineers. Financial analysts. Lawyers. Accountants. Marketing managers. HR specialists. Middle management. If your job involves sitting at a computer processing information, writing things, or analyzing data - you're on the list. <h2> The Numbers Are Not Comforting </h2> A survey of 2,500 white-collar tech workers found 61% believe AI will replace their current role within three years. Not eventually. Three years. Goldman Sachs puts the global figure at 300 million jobs at risk. In February, AI executive Matt Shumer published an essay on X comparing t

DEV Community

2mabout 1 hour ago

ProductsLive

Automate your Creem payments with this OpenClaw Agent

If you run a subscription-based SaaS business, you usually learn about problems too late. A payment fails, a customer churns, or a dispute appears, and nobody sees it until hours later. This openclaw agent is built to close that gap. It listens to Creem webhooks in real time, sends clear alerts, analyzes churn risk, and can even execute retention actions when the policy says it is safe. <h2> The Big Picture </h2> <a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F32y2774xv380famrrpij.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2

DEV Community

4mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 122 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

ModelsFresh

Google AI educator training series expands digital skills push across K-12 and higher education - EdTech Innovation Hub

<a href="https://news.google.com/rss/articles/CBMie0FVX3lxTFBQTVFQNE91MHp2bEF1QlE5QlNLQ0daRjFHZVdzT09iOUpxNUZHbDEtWW9ybHdaYmFSbmUzbk1ReHBDS2FSZkpnMXVkeGQ4SEVMOG5WbnNNRUtvYjdiVDdJY1FUZ2pVTC05QUYxRkQwWUh5M1Z4aEpJLUtmcw?oc=5" target="_blank">Google AI educator training series expands digital skills push across K-12 and higher education</a> EdTech Innovation Hub

GNews AI Google

1mabout 6 hours ago

ModelsLive

Simplicity: a New Method

Simplicity is a cost-effective humorous posting method. Minimal word count, maximal chuckles. Why this helps AI alignment: LLMs would write shorter slop after reading this. Discuss

LessWrong AI

1mabout 1 hour ago

ModelsLive

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

<h1> 3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless </h1> LLM Chain-of-Thought (CoT) — the mechanism where models output their reasoning process as text before answering — has been treated as a window into model thinking. The question of whether CoT actually reflects internal reasoning (faithfulness) has attracted serious research. Numbers like "DeepSeek-R1 acknowledges hints 39% of the time" circulate as if they're objective measurements. But can you trust those numbers? A March 2026 ArXiv paper (Young, 2026) demolished this assumption. Apply three different classifiers to the same data and faithfulness scores come out at 74.4%, 82.6%, and 69.7%. A 13-point spread. Statistically significant — 95% confidence intervals don't overlap. The more s

DEV Community

9m43 minutes ago

ModelsFresh

Anthropic Races to Contain Leak of Code Behind Claude AI Agent - WSJ

<a href="https://news.google.com/rss/articles/CBMipgNBVV95cUxNd2Z0TkxScHVGWm91MC1xUlBnN2hycFNkOGRJZHVjUElPNTdHU2NIODNSRFMxVlRpSkpjUlhOY29zVEtTVTlWbDhFM0dmS2Q0NkJWcEVGbndoOTZHelRScVFJd180WURVVG9hemdOck1FOXdPZ3A5LTlqMmdHMFdVVjRSaGhMM2RMd0R4NDBXY055Ni1qY3FZdTB4bU5zNGNOMnhfZXRoQXBuZjkyWG90bXE5am1rMmIzbTRCbmsyMjg0LXRXSWppeWJnbTJPSGVKWXIxWmlUMEJmR2d1VUxWcUMxdjctdEFLN3dpYlRhWHlwVFh3WXg1V2ozUl9kNl9rcmM2bk9INTNmMjVvMlR4OVFYQXpIYVk0STVlaWx4VkVKZGtvZV9ERU0xQTNxLXFNTXpjYS0tX2FRSlNqbmt2bi1wMndtdjlvdXlRT2Nqb3FqbHNjNS1rOUV6NXNvQjdsdG1LOWdHUTEyNnNqbWtiRktoRl9Nbm1HVHBWbm9DaVpFSWQwUmNjQnZhMW1Nc2JWQjVjT2NSTkRaWHhfWDhzTFFocmZfQQ?oc=5" target="_blank">Anthropic Races to Contain Leak of Code Behind Claude AI Agent</a> WSJ

GNews AI coding

1mabout 11 hours ago