5 min read
5 min read

Microsoft has achieved an AI inference speed record of 1.1 million tokens per second on its Azure ND GB300 v6 virtual machines. Running Meta’s Llama 2 70B model with NVIDIA’s latest GB300 NVL72 rack, Azure demonstrated performance once thought impossible.
This milestone highlights how cloud-scale optimization and hardware co-innovation with NVIDIA are redefining what’s achievable for enterprise AI workloads in production environments.

At the heart of this breakthrough is NVIDIA’s Blackwell Ultra GPU architecture, paired with 36 NVIDIA Grace CPUs across the NVL72 rack (18 VMs, each with 2 Grace CPUs), all housed within a single rack.
Each of the 72 GPUs achieved roughly 15,200 tokens per second (±5 %) as observed in the benchmark.
Microsoft fine-tuned its Azure VMs for inference workloads, combining advanced memory bandwidth, high-efficiency NVLink connectivity, and the TensorRT-LLM inference library to deliver record-breaking throughput.

Azure’s ND GB300 v6 setup crushed its previous benchmark of 865,000 tokens per second, posting a 27% performance gain over the ND GB200 v6 system. The test used the FP4 precision low-bit quantization format, which maintains accuracy while significantly boosting speed.
Independent benchmarking firm Signal65 validated the results, calling them “definitive proof that large-scale AI is now a reliable and efficient utility.”

Crossing one million tokens per second is a significant infrastructure milestone. In principle, it could enable real-time inference for multiple simultaneous users, making large-model deployment more feasible at scale.
In practice, it means thousands of users can interact simultaneously with large language models, making AI tools more scalable and accessible than ever.

Each ND GB300 v6 VM hosts four NVIDIA GB300 GPUs, each with 279 GB GPU memory and a 1,400 W per-GPU power limit. These VMs run on racks engineered for extreme workloads, moving data between components through NVLink C2C channels four times faster than prior designs.
With 7.37 TB/s HBM of high-bandwidth memory throughput and 92% efficiency, the system minimizes bottlenecks even under heavy AI inference loads.

Hardware alone can’t reach 1.1 million tokens/sec. Azure used NVIDIA TensorRT-LLM, a purpose-built inference engine that fuses graph optimizations, dynamic batching, and quantization techniques.
Combined with Microsoft’s cloud orchestration tools, it created a runtime that continuously balances performance across 72 GPUs, ensuring maximum utilization and stability during high-throughput AI tasks.

Using FP4 precision was key to achieving this breakthrough. It reduces the number of bits used per operation without compromising accuracy, slashing computation time and power draw.
This efficiency enables faster responses from large models, such as Llama 2 70B, while maintaining high-quality output. For production AI, FP4 could become the new gold standard for inference optimization.

Performance benchmarks can be controversial, but Microsoft made this one transparent. Every log, configuration, and throughput figure was shared with Signal65, whose engineers verified the 1.1 million tokens/sec claim.
The validation process ensured the record wasn’t just an internal milestone; it is now recognized as a trusted industry benchmark for cloud AI performance.

The ND GB300 v6 isn’t just faster or wiser. The system integrates real-time telemetry and load balancing, enabling Azure to reroute workloads and prevent slowdowns automatically.
This design maintains smooth inference even during spikes in demand, a critical advantage for enterprises running global generative AI applications that rely on 24/7 responsiveness.

Speed isn’t only about bragging rights; it’s also about efficiency. A 27% gain in throughput means customers can deploy fewer racks to achieve the same performance, reducing operational costs and energy usage.
That makes AI inference more affordable and sustainable, expanding access for organizations that once couldn’t justify large-scale model deployment.

This record underscores the deep collaboration between Microsoft and NVIDIA. Their partnership blends NVIDIA’s world-class hardware innovation with Azure’s scalable cloud infrastructure.
Satya Nadella called the result “an industry record made possible by co-innovation.” It’s proof that when the world’s most enormous cloud meets the world’s fastest GPUs, the limits of AI continue to move forward.

This milestone illustrates how the next generation of AI will be built not on isolated systems, but on global cloud architectures capable of handling colossal workloads efficiently.
As Microsoft continues to refine its Azure AI stack, these breakthroughs will shape the foundation for the AI-driven economy, powering innovation from startups to supercomputers.
Learn how a recent security flaw reminded users that even the strongest clouds need vigilance. A Microsoft Entra flaw let hackers access any account; patch now.

Microsoft’s 1.1 million tokens/sec achievement isn’t just about numbers, but about unlocking a new era of possibility. It demonstrates that cloud infrastructure can now keep pace with human thought.
For developers, researchers, and enterprises alike, Azure’s latest record sends a clear message: the future of AI speed, scale, and reliability has already arrived.
See how Microsoft’s next big move could bring Elon Musk’s Grok AI to Azure in Microsoft Azure Will Soon Host Elon Musk’s Grok AI Model.
What do you think about Microsoft Azure showing stats of how much its workload is per second? Please share your thoughts and drop a comment.
Read More From This Brand:
Don’t forget to follow us for more exclusive content on MSN.
This slideshow was made with AI assistance and human editing.
This content is exclusive for our subscribers.
Get instant FREE access to ALL of our articles.
Father, tech enthusiast, pilot and traveler. Trying to stay up to date with all of the latest and greatest tech trends that are shaping out daily lives.
We appreciate you taking the time to share your feedback about this page with us.
Whether it's praise for something good, or ideas to improve something that
isn't quite right, we're excited to hear from you.
Stay up to date on all the latest tech, computing and smarter living. 100% FREE
Unsubscribe at any time. We hate spam too, don't worry.

Lucky you! This thread is empty,
which means you've got dibs on the first comment.
Go for it!