Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
arXiv:2603.28342v1 Announce Type: cross Abstract: We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA — He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen
Authors:He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen
View PDF HTML (experimental)
Abstract:We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
Subjects:
Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2603.28342 [cs.CL]
(or arXiv:2603.28342v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2603.28342
arXiv-issued DOI via DataCite (pending registration)
Submission history
From: He Du [view email] [v1] Mon, 30 Mar 2026 12:12:49 UTC (994 KB)
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Research Papers

New research could empower people without AI expertise to help create trustworthy AI applications
Involving people without AI expertise in the development and evaluation of artificial intelligence applications could help create better, fairer, and more trustworthy automated decision-making systems, new research suggests. After enlisting members of the public to evaluate the potential impacts of two real-world applications, researchers from UK universities will present a paper at a major international computing conference which suggests how "participatory AI auditing" could improve AI decision-making in the future.
Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy
Analysis of behavioral consistency in large language model agents reveals that while consistent performance correlates with higher accuracy, consistency can amplify both correct and incorrect interpretations, emphasizing that accurate interpretation is more crucial than execution consistency for production deployment. (2 upvotes on HuggingFace)





Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!