NVIDIA Extreme Co-Design Delivers New MLPerf Inference Records

NVIDIA Developer Blogby Ashraf EassaApril 1, 20261 min read0 views

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak...

Co-designed hardware, software, and models are key to delivering the highest AI factory throughput and lowest token cost. Measuring this goes far beyond peak chip specifications. Rigorous AI inference performance benchmarks are critical to understanding real-world token output, which drives AI factory revenue.

MLPerf Inference v6.0 is the latest in a series of industry benchmarks that measure performance across a wide range of model architectures and use cases. In this latest round, systems powered by NVIDIA Blackwell Ultra GPUs delivered the highest throughput across the widest range of models and scenarios. This brings the cumulative NVIDIA MLPerf training and inference wins since 2018 to 291, which is 9x of all other submitters combined.

This round, the NVIDIA partner ecosystem participated broadly, with 14 partners—the largest number of partners submitting on any platform. ASUS, Cisco, CoreWeave, Dell Technologies, GigaComputing, Google Cloud, HPE, Lenovo, Nebius, Netweb Technology, Quanta Cloud Technology (QCT), Red Hat, Supermicro, and Lambda have delivered excellent performance on the NVIDIA platform.

Figure 1. NVIDIA Delivers 9x More Cumulative MLPerf Training and Inference wins

This post takes a closer look at the latest benchmark updates, the industry-leading performance achieved on the NVIDIA platform, and the full-stack engineering that makes it possible.

New benchmarks, new performance records

The MLPerf Inference benchmark suite is routinely updated to ensure that it reflects models, modalities, use cases, and deployment scenarios that matter to the community. Only the NVIDIA platform submitted results on all newly added models and scenarios this round, and delivered the highest performance across all of them.

This round of MLPerf Inference added several new tests, including:

DeepSeek-R1 Interactive: Following the addition of DeepSeek-R1 reasoning LLM based on a sparse mixture-of-experts (MoE) architecture in MLPerf Inference v5.1, MLCommons added a new Interactive scenario with 5x faster minimum token rate and 1.3x shorter time to first token compared to the server scenario, representing higher-interactivity deployments
Qwen3-VL-235B-A22B: Vision-language model with a total of 235B parameters. This represents the first multi-modal model in the MLPerf Inference suite. Two scenarios are tested: Offline and Server.
GPT-OSS-120B: 120B-parameter MoE reasoning LLM, developed by OpenAI. This benchmark includes three scenarios: Offline, Server, and Interactive
WAN-2.2-T2V-A14B: 4B-parameter text-to-video generative AI model. Two scenarios tested: single-stream, which measures the latency to process a single video generation request, and offline, which measures the number of samples processed per second in a batch-processing scenario.
DLRMv3 – A generative recommendation benchmark that replaces the DLRM-DCNv2 test. It uses a transformer-based architecture that increases model size and compute intensity compared to the prior benchmark. It tests offline and server scenarios.

BenchmarkDeepSeek-R1GPT-OSS-120BQwen3-VLWan 2.2DLRMv3Offline2,494,310 tokens/sec1,046,150 tokens/sec79 samples/sec0.059 samples/sec104,637 samples/secServer1,555,110 tokens/sec1,096,770 tokens/sec68 queries/sec21 secs**(Single Stream)99,997 queries/secInteractive250,634 tokens/sec677,199 tokens/sec******Table 1. NVIDIA platform throughput on newly added workloads and scenarios in MLPerf Inference v6.0

Not a new scenario in MLPerf Inference v6.0** Wan 2.2 features a single stream scenario, which measures end-to-end request latency, instead of a server scenario. Lower is better.*** Not tested in MLPerf Inference v6.0***

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0039, 6.0-0073, 6.0-0075, 6.0-0076, 6.0-0078, 6.0-0081, 6.0-0094. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Figure 2. NVIDIA achieves 2.7x performance gain and 2.5M token/s on DeepSeek-R1

NVIDIA TensorRT-LLM software updates unlock up to 2.7X performance gains on the same Blackwell Ultra GPUs

NVIDIA continually optimizes the performance of its software stack to increase delivered token throughput from existing platforms. This delivers reductions in token production cost and enables AI factory operators to serve more users to generate more revenue with a given infrastructure footprint.

The additional performance also provides headroom to run future AI models and serve existing models in demanding scenarios, such as higher token rates and longer contexts. This continual improvement makes it possible for NVIDIA GPUs introduced years ago to remain productive, at high utilization rates, in the cloud.

This round, NVIDIA GB300 NVL72—launched last year—delivered up to 2.7x higher token throughput compared to its debut submissions just six months ago on the server scenario of the DeepSeek-R1 benchmark1. This means 2.7x more tokens from the same GB300 NVL72-based infrastructure and power footprint, reducing the cost to manufacture each token by more than 60%. This speedup, achieved by NVIDIA partner Nebius, showcases a core advantage of the NVIDIA platform: an open, expansive ecosystem where customers and partners can uniquely optimize and innovate on top of our software stack.

1MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0081. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Powering the DeepSeek R1 performance improvements in the server and offline scenarios were several software enhancements, including:

Faster kernels—this included a combination of higher-performance kernels and the use of fewer kernels because of kernel fusions.
Optimized Attention Data Parallel—Better balancing of context requests between different ranks, enabling significant speedups in end-to-end performance.

The latest features of the open source NVIDIA TensorRT-LLM inference serving software and the NVIDIA Dynamo open source distributed inference serving framework were used to support the newly added and more challenging DeepSeek-R1 Interactive scenario. This includes:

Disaggregated serving: This capability in Dynamo separates and individually optimizes the configurations of each inference phase (prefill and decode), respectively, enabling optimal overall throughput.
Wide Expert Parallel (WideEP): For higher-interactivity scenarios, execution time for MoE models is bound by expert weight load time. By splitting, or sharding, the experts across multiple GPUs across NVL72 nodes, this bottleneck is reduced, improving end-to-end performance.
Multi-Token Prediction (MTP): At higher interactivity levels, batch sizes are smaller, and performance is dominated by how quickly weights can load into memory, leaving compute performance underutilized. By applying compute otherwise that goes unutilized to predict and verify additional tokens in parallel (up to three in this implementation), throughput at high interactivity is increased.
KV-aware routing: This capability of Dynamo routes inference requests by evaluating their compute costs across different workers.

NVIDIA was the first and only platform to submit DeepSeek-R1 results on MLPerf Inference when the benchmark debuted last year. This round, NVIDIA not only increased performance on returning scenarios for DeepSeek-R1 but ‌was once again the only platform to submit on the newly added interactive scenario.

And even on Llama 3.1 405B—a very large, dense LLM launched almost two years ago— GB300 NVL72 performance increased by 1.5x in the server scenario.

BenchmarkGB300 NVL72 v5.1GB300 NVL72v6.0SpeedupDeepSeek-R1(Server)2,907 tokens/sec/gpu8,064 tokens/sec/gpu2.77xDeepSeek-R1(Offline)5,842 tokens/sec/gpu9,821 tokens/sec/gpu1.68xLlama 3.1 405B(Server)170 tokens/sec/gpu259 tokens/sec/gpu1.52xLlama 3.1 405B(Offline)224 tokens/sec/gpu271 tokens/sec/gpu1.21xTable 2. Performance improvements, normalized on a per-GPU basis, on DeepSeek-R1 and Llama 3.1 405B server and offline scenarios in v6.0 compared to v5.1

MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072, 6.0-0017, 6.0-0078, 6.0-0082. Per chip performance is derived by dividing total throughput by the number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.1 or v6.0. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Additionally, NVIDIA submissions on the newly added multimodal, video generation, and recommendation benchmarks were powered by open source software frameworks optimized for the NVIDIA platform. The Qwen3-VL vision-language submission used the vLLM open source framework, showing how the community is rapidly building advanced multimodal optimizations to accelerate image-heavy inference workloads on the latest GPUs like NVIDIA Blackwell Ultra. The WAN-2.2 text-to-video submission used the TensorRT-LLM VisualGen, which accelerates diffusion-based video generation pipelines on NVIDIA GPUs.

For DLRMv3, the submission was built on two open-source projects: the NVIDIA recsys-example for high-performance transformer-based recommendation inference, and NV Embedding Cache for GPU-accelerated embedding table lookups. Both were critical to achieving record throughput on this more demanding generative recommendation benchmark.

Through extensive and ongoing engineering, NVIDIA is continually increasing performance on existing hardware on existing models, as evidenced by these results. At the same time, NVIDIA collaborates closely with model builders and open source inference frameworks to ensure that the latest models run on the NVIDIA platform on the day of launch.

Scale-out inference with NVIDIA Quantum-X800 InfiniBand platform enables millions of tokens per second

NVIDIA also set new throughput records at scale on the DeepSeek-R1 model in the offline and server scenarios by submitting results using four GB300 NVL72 systems interconnected with NVIDIA Quantum-X800 InfiniBand scale-out networking.

DeepSeek-R1 | 4x GB300 NVL72Tokens/SecondOffline2,494,310Server1,555,110Table 3. DeepSeek-R1 throughput on four GB300 NVL72 systems scaled up with NVLink and scaled out with NVIDIA Quantum-X800 InfiniBand

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

With 288 Blackwell Ultra GPUs—the largest scale ever submitted to any benchmark in MLPerf Inference—submissions set new system-level throughput records, enabling millions of tokens processed per second.

Looking ahead to MLPerf Endpoints

Delivered inference throughput takes extreme co-design across many chips, system architecture, data center design, and software. The latest MLPerf Inference v6.0 results show that NVIDIA yields unmatched inference throughput across the broadest range of workloads, from massive LLMs to advanced vision language models, to generative recommender systems and more, on industry-standard benchmarks.

AI inference workloads also continue to evolve rapidly, as model sizes grow and context lengths rise. As agentic AI becomes more prevalent, premium use cases that require ultra-fast token rates are emerging.

NVIDIA has been working, as part of the MLCommons consortium, to lead the definition of the MLPerf Endpoints benchmark. MLPerf Endpoints will give the community a rigorous, auditable picture of how deployed services perform under real API traffic—capturing key performance metrics that chip-level benchmarks alone cannot reveal—while providing the rigor and result integrity that defines MLPerf benchmarks.

To explore the latest performance on the NVIDIA platform across training, inference, and high-performance computing, please see our deep learning product performance page.

Acknowledgements

NVIDIA MLPerf Inference v6.0 results reflect the work of many talented engineers across the company. We’d like to acknowledge the contributions of the following individuals (last name sorted):

Tomar Bar-on, Nitin Sai Bommi, Viraat Chandra, Alice Cheng, Jerry Chen, Xiaoming Chen, Jesus Corbal San Adrian, Ashutosh Dhar, Kefeng Duan, Wookje Han, Kyle Huang, Kris Hung, Rashid Kaleem, Khubaib Khubaib, Zihao Kong, Tin-Yin Lai, Tao Li, Forrest Lin, Wanqian Li, Alex Liu, Jintao Peng, Yuxian Qiu, Junyi Qiu, Xiaowei Shi, Olivia Stoner, Jacob Subag, Tong Tong, Harshil Vagadia, Shobhit Verma, June Yang, Tailing Yuan, Ben Zhang… and many others across NVIDIA whose efforts made these results possible.

Original source

NVIDIA Developer Blog

https://developer.nvidia.com/blog/nvidia-extreme-co-design-delivers-new-mlperf-inference-records/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

model

ProductsLive

In the wake of Claude Code's source code leak, 5 actions enterprise security leaders should take now

Every enterprise running AI coding agents has just lost a layer of defense. On March 31, Anthropic accidentally shipped a 59.8 MB source map file inside version 2.1.88 of its @anthropic-ai/claude-code npm package , exposing 512,000 lines of unobfuscated TypeScript across 1,906 files. The readable source includes the complete permission model, every bash security validator, 44 unreleased feature flags, and references to upcoming models Anthropic has not announced. Security researcher Chaofan Shou broadcast the discovery on X by approximately 4:23 UTC. Within hours, mirror repositories had spread across GitHub. Anthropic confirmed the exposure was a packaging error caused by human error. No customer data or model weights were involved. But containment has already failed. The Wall Street Jour

VentureBeat AI

11mabout 1 hour ago

ProductsLive

Will AI Agents Make Bias Worse?

What happens when biased models get memory, tools, and decision power Continue reading on Towards AI »

Towards AI

1mabout 1 hour ago

ProductsLive

Blazor WASM's Deputy Thread Model Will Break JavaScript Interop - Here's Why That Matters

<h2> The Problem </h2> Microsoft is changing how .NET runs inside WebAssembly. When you enable threading with <code><WasmEnableThreads>true</WasmEnableThreads></code>, the entire .NET runtime moves off the browser's main thread and onto a background Web Worker — what they call the "Deputy Thread" model. This sounds like a good idea on paper. The UI stays responsive. .NET gets real threads. Everyone wins. Except it breaks JavaScript interop. Not in a subtle, edge-case way. It breaks it fundamentally. <h2> What Actually Happens </h2> In traditional Blazor WASM (no threading), .NET and JavaScript share the same thread. When JavaScript calls <code>DotNet.invokeMethod</code>, the CPU jumps from the JS stack to the C# stack and back. It's fast. I

DEV Community

6mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 205 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users - Futurism

<a href="https://news.google.com/rss/articles/CBMikwFBVV95cUxQWnR0SXhyVm01QXZhUTNsWDNYSFNoNDZnRWpuN3M0Skw5LXJVNFVOSWg4TWRXSEFqY2Zab0M2LWhKV1hZa0xKcDJId19RSW1WRndVREU1TFVZSl8tZ3U1MGk3U2kzWWtDbm9ZWmNMM3R5VFpMdXJ3ZzlHaXZGR2FQbHBqeWFZekppZHdhVTYyU3BnWDA?oc=5" target="_blank">Paper Finds That Leading AI Chatbots Like ChatGPT and Claude Remain Incredibly Sycophantic, Resulting in Twisted Effects on Users</a> Futurism

Google News: ChatGPT

1m2 days ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxOUEdqRE9rOUU0Uldvd2xrbkdYd0pqQ3AxVnJ3UG9TNTlVQ3M4NF96T3hVYTloNkZiVGFoM1NUWTJPdkpIUldzVDNRa3JfaWpBWjVNVUR5YkM0SXhRVTRUZEhhVGJHR0lTV1dzb2FkVkVnZnNpcEdVa3M3Tm9wSDhfVnk1MWJDWEZTMmRWcmZzWXVkQXczb010Z1IzNGc5SlA2N0RzX3pQdThiR2J5UlVnZFd3NjFiRkNqQlVwaTN2X0ZWVGZ5bUVqRUhPUWdpUXJUalRKZm1HeWJicF9pbVlQbHVmZUkzYVBpM2NIR1l5SUVnY1R5TnEydlI0R0xfRW9RMHZYNGFnYlNvVEtZRC1leGZ2bndiSl9tZE5seFZsRWtXeFZVMVRRWXFpelBzTVdQeDdYVlR1ckNxcDRJbUFpOUtuNGNkN3A1aHE2R21CQUR3aXQtWnlvWkE1aHdUWFB0d01uRzRaa2JaYnZhRWFjcmptNGttaE9LWTM4WE9yT2p4MjZpSVFiNG1tZERlWnZYXzhxYjROb2ZseENWNW82TFln?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> WSJ

Google News: OpenAI

1m3 days ago

Models

Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT - WSJ

<a href="https://news.google.com/rss/articles/CBMiogNBVV95cUxOdVlCQ2pGTkZxNW5LeWp0UUlmYU5MTm1jMTBtb1M0VVBjTmZYb2VYdjhGR1FHUWNrbVJvT0xNYzJQLXBNY1RTQ2JDUHlBRGZpUzYySG01S0ZOQzhUVjJIUFhYamU5YWNhWU5zRkl6ZkU5SG1NclFmcnN0cHZlZ2VJOGY0Q2x2Y1h6OXk1Nk5PdHl3MEdfOGlvRS1Wajdab1pzamZZdldtVmt5SVlLY2V5SlRkbWlic1J1OXNuYU9JdmxyR2s1WXozS2k4UXhVUmkzSFJfSUJReDk3U0lOVUJWb1BBVkktYW1zbVViRnhZaE40SVNOcXpURUZuQ2dhZ3NxbEdqRkRDc01tWDlONDhhQkt4Z3RhQWthVURoVmRjUzdCU2dZMkRzazdlZ09ST3VQS2piNlZhYjYycTdsZHF3ZmZDdk1CdEVQY0NVWHZrY1YyaHlQblBpOXNPMzdvWXhuWUhpNzloVlBBcnNvVjlJbWs5OTg0Mk8tdTl4eGlzcTI2TjlNUGk0RkVIY3U0azVTREgxenM2S2t4aTBtTTNHYnVR?oc=5" target="_blank">Exclusive | The Sudden Fall of OpenAI’s Most Hyped Product Since ChatGPT</a> WSJ

Google News: ChatGPT

1m3 days ago

ModelsLive

datasette-llm 0.1a6

Release: <a href="https://github.com/datasette/datasette-llm/releases/tag/0.1a6">datasette-llm 0.1a6</a> <blockquote> <ul> <li>The same model ID no longer needs to be repeated in both the default model and allowed models lists - setting it as a default model automatically adds it to the allowed models list. <a href="https://github.com/datasette/datasette-llm/issues/6">#6</a></li> <li>Improved documentation for <a href="https://github.com/datasette/datasette-llm/blob/0.1a6/README.md#usage">Python API usage</a>.</li> </ul> </blockquote>

Simon Willison Blog

1mabout 2 hours ago