Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles
The latest round of the industry's top inference benchmark introduces multimodal and video models for the first time. Nvidia, AMD, and Intel each highlight different metrics, making direct comparisons difficult. The article Nvidia sets new MLPerf records with 288 GPUs while AMD and Intel focus on different battles appeared first on The Decoder .
The latest round of the industry's top inference benchmark introduces multimodal and video models for the first time. Nvidia, AMD, and Intel each highlight different metrics, making direct comparisons difficult.
Benchmark organization MLCommons published the results of MLPerf Inference v6.0 on April 1, 2026. All three major chipmakers submitted results and claimed top spots. But the results are only partially comparable: Nvidia, AMD, and Intel use different system configurations, models, and scenarios, and each company frames its numbers to put its own strengths front and center.
Nvidia, for example, showcases its records primarily on DeepSeek-R1 and the new GPT-OSS-120B, sometimes using 288-GPU configurations. AMD compares itself to Nvidia's B200 and B300 in single-node setups with eight GPUs but didn't submit results for DeepSeek-R1 or the multimodal Qwen3-VL. Intel targets an entirely different market segment, competing with workstation GPUs. Anyone trying to make sense of these numbers needs to keep these differences in mind.
Notably absent are submissions from Google for its latest Ironwood-generation TPU chips or inference specialists like Cerebras.
Five new benchmarks significantly expand the test suite
MLPerf Inference v6.0 adds several new tests: an interactive scenario for DeepSeek-R1 with a five-times-higher minimum token rate, the vision-language model Qwen3-VL-235B as the suite's first multimodal model, OpenAI's GPT-OSS-120B, the text-to-video model WAN-2.2-T2V, and the transformer-based recommendation benchmark DLRMv3. Only Nvidia submitted results for all new models and scenarios.
Software optimizations alone double Nvidia's throughput on the same hardware
According to Nvidia, the GB300-NVL72 system with Blackwell Ultra GPUs achieved the highest throughput across all new workloads. The company highlights a 2.7x performance jump on DeepSeek-R1 in the server scenario compared to its first submission six months ago2 achieved on the same hardware through software optimizations alone. Nvidia partner Nebius delivered this improvement. Nvidia says it cuts token production costs by more than 60 percent.
These gains came from a series of software-level tweaks. Basic compute operations were sped up and fused together so GPUs spend less time on overhead. The open-source framework Nvidia Dynamo separates the two phases of text generation (processing the input and generating new tokens) and optimizes each independently.
For models like DeepSeek-R1 that only activate a subset of their parameters per request, Wide Expert Parallel distributes expert weights across more GPUs so no single card becomes a bottleneck. When batch sizes are small in interactive scenarios and compute power sits idle, Multi-Token Prediction generates multiple tokens at once instead of just one. Even on the older Llama 3.1 405B, server performance improved by 1.5x, according to Nvidia.
In the largest configuration ever submitted to MLPerf Inference, Nvidia connected four GB300-NVL72 systems with a total of 288 GPUs over Quantum-X800 InfiniBand. The result: roughly 2.49 million tokens per second on DeepSeek-R1 in the offline scenario. Fourteen partners submitted results on the Nvidia platform, the most of any platform in this round. Nvidia puts its cumulative MLPerf wins since 2018 at 291 - nine times more than all other submitters combined.
AMD closes the single-node gap and crosses one million tokens per second
According to AMD's blog post, the Instinct MI355X on CDNA 4 architecture with 3 nm manufacturing and up to 288 GB HBM3E crossed the one-million-tokens-per-second mark in MLPerf for the first time - though with multi-node scaling using up to 94 GPUs on Llama 2 70B and GPT-OSS-120B. Compared to the previous-generation MI325X, AMD says the MI355X delivers a 3.1x throughput jump on the Llama 2 70B server benchmark.
The most direct comparison comes in single-node setups with eight GPUs each. AMD says the MI355X matched Nvidia's B200 on Llama 2 70B in the offline scenario, hit 97 percent in the server scenario, and reached 119 percent of B200 performance in the interactive scenario. Against the newer B300, those numbers came in at 92, 93, and 104 percent, respectively. On GPT-OSS-120B, AMD says the MI355X beat the B200 by 11 and 15 percent in offline and server mode, but trailed the B300 at 91 and 82 percent.
Two important caveats apply here: AMD didn't submit results for the significantly larger DeepSeek-R1 with its MoE architecture, so exactly where Nvidia posts its strongest numbers. And AMD's submission for the text-to-video model Wan-2.2 was in the Open category rather than the Closed Division, which formally limits direct comparability. AMD also cites post-deadline results that reportedly reached 108 percent of B200 performance, but notes these numbers weren't verified by MLCommons.
Multi-node scaling across 11 nodes achieved 93 to 98 percent efficiency, according to AMD. Also noteworthy is the first-ever heterogeneous MLPerf submission: Dell and MangoBoost combined MI300X, MI325X, and MI355X GPUs across sites in the US and Korea, hitting roughly 142,000 tokens per second on Llama 2 70B in server mode. Nine partners submitted results on AMD hardware, with scores within four percent of AMD's own measurements.
Intel skips the data center fight, targets workstations and edge instead
Intel takes a fundamentally different approach. Rather than competing with Nvidia and AMD in the data center, Intel showcases its Arc Pro B70 and B65 GPUs alongside Xeon 6 processors as an inference platform for workstations and edge systems. A system with four Arc Pro B70 cards provides 128 GB of VRAM and can run 120-billion-parameter models with high parallelism, according to Intel. The Arc Pro B70 delivers up to 1.8x the inference performance of the Arc Pro B60.
Software optimizations on the same B60 hardware reportedly brought up to 1.18x performance improvements over MLPerf v5.1. Intel emphasizes that it's the only server processor maker submitting standalone CPU results for MLPerf Inference. More than half of all submissions in MLPerf 6.0 use Xeon as the host CPU.
Why these results don't produce a simple ranking
The results show that while MLPerf Inference remains the most important industry standard for AI inference benchmarks, it doesn't produce a straightforward leaderboard. Nvidia has consistently demonstrated the broadest coverage of new benchmarks and the highest absolute throughput numbers at scale for many years. In single-node setups, however, AMD claims comparable or higher scores than Nvidia's B200 in several scenarios while covering fewer benchmarks. Intel serves a different market entirely.
On top of that, each chipmaker naturally highlights the scenarios and configurations where its own products perform best. AMD's percentage comparisons against Nvidia's B200 and B300 represent the most transparent head-to-head data available, but they only apply to the models and scenarios AMD actually submitted. Nvidia's scaling results with 288 GPUs have no AMD counterpart. And Nvidia's 2.7x software improvement and AMD's 3.1x generational leap measure fundamentally different things: pure software optimization on the same hardware versus a new chip architecture.
Nvidia pushes for a new benchmark that measures real-world API performance
A step toward better comparability could come with the upcoming MLPerf Endpoints benchmark. Nvidia announces in its blog post that it's driving the definition of this benchmark within the MLCommons consortium. The reasoning is that current tests measure the throughput of individual chips and systems under standardized conditions but don't capture how an inference service actually performs under real API traffic. With the rise of agentic AI systems that demand especially fast token rates, Nvidia says the need for measurement methods that go beyond pure chip benchmarks is growing. This naturally plays to Nvidia's strengths: the company recently unveiled Vera Rubin, a system designed specifically for these workloads.
According to Nvidia, MLPerf Endpoints would give the community a verifiable picture of how deployed services actually perform under realistic load. The goal is to capture metrics that hardware benchmarks alone can't reveal, such as latency variability, throughput under concurrent requests, and overall infrastructure efficiency.
AMD, meanwhile, points to its planned MI400 series on CDNA 5 architecture and the Helios rack-scale solution for 2026. The competition for the most efficient AI inference is set to intensify further.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelbenchmarkmultimodal
From Desktop to Web: A Guide to Publishing and Embedding Power BI Reports
Power BI is a powerful business intelligence tool that transforms raw data into immersive, interactive visual stories. However, the true value of a report is realized only when it is shared with stakeholders. Publishing is the process of moving your report from the local Power BI Desktop environment to the cloud-based Power BI Service, where it can be managed, shared, and integrated into other platforms like company websites or portals. Step 1: Creating a Workspace A Workspace is a collaborative container in the Power BI Service where you house your reports, dashboards, and datasets. Sign in to the Power BI Service. On the left-hand navigation pane, click on Workspaces. Select Create a workspace (usually at the bottom of the pane). Give your workspace a unique name (e.g., "Sales Analytics

How to Publish a Power BI Report and Embed It on a Website
You have built a Power BI report. The charts look sharp, the DAX measures are doing their job, and the data model is clean. Now what? The report is sitting on your local machine in a .pbix file that nobody else can see or interact with. This article walks you through the final stretch: publishing that report to the Power BI Service and embedding it on a website. We cover two approaches. The first is Publish to web , which makes your report publicly accessible to anyone with the link. The second is the Website or portal method, which requires viewers to sign in and respects your data permissions. Both produce an interactive iframe you drop into your HTML. We will also cover workspace creation, publishing from Desktop, responsive design, URL filtering, and troubleshooting. What you need befo

I Connected 12 MCP Servers to Amazon Q. Here's What Broke
👋 Hey there, tech enthusiasts! I'm Sarvar, a Cloud Architect with a passion for transforming complex technological challenges into elegant solutions. With extensive experience spanning Cloud Operations (AWS Azure), Data Operations, Analytics, DevOps, and Generative AI, I've had the privilege of architecting solutions for global enterprises that drive real business impact. Through this article series, I'm excited to share practical insights, best practices, and hands-on experiences from my journey in the tech world. Whether you're a seasoned professional or just starting out, I aim to break down complex concepts into digestible pieces that you can apply in your projects. Let's dive in and explore the fascinating world of cloud technology together! 🚀 Written from experience building AI age
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

I wrote a fused MoE dispatch kernel in pure Triton that beats Megablocks on Mixtral and DeepSeek at inference batch sizes
Been working on custom Triton kernels for LLM inference for a while. My latest project: a fused MoE dispatch pipeline that handles the full forward pass in 5 kernel launches instead of 24+ in the naive approach. Results on Mixtral-8x7B (A100): Tokens vs PyTorch vs Megablocks 32 4.9x 131% 128 5.8x 124% 512 6.5x 89% At 32 and 128 tokens (where most inference serving actually happens), it's faster than Stanford's CUDA-optimized Megablocks. At 512+ Megablocks pulls ahead with its hand-tuned block-sparse matmul. The key trick is fusing the gate+up projection so both GEMMs share the same input tile from L2 cache, and the SiLU activation happens in registers without ever hitting global memory. Saves ~470MB of memory traffic per forward pass on Mixtral. Also tested on DeepSeek-V3 (256 experts) and




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!