Was this helpful?
Thumbs UP Thumbs Down

Microsoft’s Azure cloud just smashed an AI inference record at 1.1M tokens/sec

Microsoft headqaurter
Microsoft Azure logo displayed on a phone

Azure sets a new benchmark for AI performance

Microsoft has achieved an AI inference speed record of 1.1 million tokens per second on its Azure ND GB300 v6 virtual machines. Running Meta’s Llama 2 70B model with NVIDIA’s latest GB300 NVL72 rack, Azure demonstrated performance once thought impossible.

This milestone highlights how cloud-scale optimization and hardware co-innovation with NVIDIA are redefining what’s achievable for enterprise AI workloads in production environments.

Blackwell Nvidia

A powerful blend of hardware and software innovation

At the heart of this breakthrough is NVIDIA’s Blackwell Ultra GPU architecture, paired with 36 NVIDIA Grace CPUs across the NVL72 rack (18 VMs, each with 2 Grace CPUs), all housed within a single rack.

Each of the 72 GPUs achieved roughly 15,200 tokens per second (±5 %) as observed in the benchmark.

Microsoft fine-tuned its Azure VMs for inference workloads, combining advanced memory bandwidth, high-efficiency NVLink connectivity, and the TensorRT-LLM inference library to deliver record-breaking throughput.

microsoft european hq in munich germany

How Microsoft shattered its own record

Azure’s ND GB300 v6 setup crushed its previous benchmark of 865,000 tokens per second, posting a 27% performance gain over the ND GB200 v6 system. The test used the FP4 precision low-bit quantization format, which maintains accuracy while significantly boosting speed.

Independent benchmarking firm Signal65 validated the results, calling them “definitive proof that large-scale AI is now a reliable and efficient utility.”

Customer and chatbot dialog on a smartphone screen

What the one-million-token milestone really means

Crossing one million tokens per second is a significant infrastructure milestone. In principle, it could enable real-time inference for multiple simultaneous users, making large-model deployment more feasible at scale.

In practice, it means thousands of users can interact simultaneously with large language models, making AI tools more scalable and accessible than ever.

Close up shot of a Nvidia gaming processor

The engine behind the record

Each ND GB300 v6 VM hosts four NVIDIA GB300 GPUs, each with 279 GB GPU memory and a 1,400 W per-GPU power limit. These VMs run on racks engineered for extreme workloads, moving data between components through NVLink C2C channels four times faster than prior designs.

With 7.37 TB/s HBM of high-bandwidth memory throughput and 92% efficiency, the system minimizes bottlenecks even under heavy AI inference loads.

Microsoft Azure AI displayed on a phone screen.

The software layer makes the difference

Hardware alone can’t reach 1.1 million tokens/sec. Azure used NVIDIA TensorRT-LLM, a purpose-built inference engine that fuses graph optimizations, dynamic batching, and quantization techniques.

Combined with Microsoft’s cloud orchestration tools, it created a runtime that continuously balances performance across 72 GPUs, ensuring maximum utilization and stability during high-throughput AI tasks.

Businessman AI artificial intelligence in modern medical technology and iot

Why the FP4 precision mode matters

Using FP4 precision was key to achieving this breakthrough. It reduces the number of bits used per operation without compromising accuracy, slashing computation time and power draw.

This efficiency enables faster responses from large models, such as Llama 2 70B, while maintaining high-quality output. For production AI, FP4 could become the new gold standard for inference optimization.

Microsoft azure cloud logo displayed on phone

Independent validation adds credibility

Performance benchmarks can be controversial, but Microsoft made this one transparent. Every log, configuration, and throughput figure was shared with Signal65, whose engineers verified the 1.1 million tokens/sec claim.

The validation process ensured the record wasn’t just an internal milestone; it is now recognized as a trusted industry benchmark for cloud AI performance.

Businessman working with a cloud computing diagram on the new

Azure’s architecture keeps evolving

The ND GB300 v6 isn’t just faster or wiser. The system integrates real-time telemetry and load balancing, enabling Azure to reroute workloads and prevent slowdowns automatically.

This design maintains smooth inference even during spikes in demand, a critical advantage for enterprises running global generative AI applications that rely on 24/7 responsiveness.

cloud server and computing data storage and processing internet and

How this changes enterprise AI economics

Speed isn’t only about bragging rights; it’s also about efficiency. A 27% gain in throughput means customers can deploy fewer racks to achieve the same performance, reducing operational costs and energy usage.

That makes AI inference more affordable and sustainable, expanding access for organizations that once couldn’t justify large-scale model deployment.

santa clara ca  feb 1 2018 nvidia corp leader

Microsoft and NVIDIA push boundaries together

This record underscores the deep collaboration between Microsoft and NVIDIA. Their partnership blends NVIDIA’s world-class hardware innovation with Azure’s scalable cloud infrastructure.

Satya Nadella called the result “an industry record made possible by co-innovation.” It’s proof that when the world’s most enormous cloud meets the world’s fastest GPUs, the limits of AI continue to move forward.

AI generated images on phone

The future of generative AI at scale

This milestone illustrates how the next generation of AI will be built not on isolated systems, but on global cloud architectures capable of handling colossal workloads efficiently.

As Microsoft continues to refine its Azure AI stack, these breakthroughs will shape the foundation for the AI-driven economy, powering innovation from startups to supercomputers.

Learn how a recent security flaw reminded users that even the strongest clouds need vigilance. A Microsoft Entra flaw let hackers access any account; patch now.

Microsoft headqaurter

A record that redefines what’s possible

Microsoft’s 1.1 million tokens/sec achievement isn’t just about numbers, but about unlocking a new era of possibility. It demonstrates that cloud infrastructure can now keep pace with human thought.

For developers, researchers, and enterprises alike, Azure’s latest record sends a clear message: the future of AI speed, scale, and reliability has already arrived.

See how Microsoft’s next big move could bring Elon Musk’s Grok AI to Azure in Microsoft Azure Will Soon Host Elon Musk’s Grok AI Model.

What do you think about Microsoft Azure showing stats of how much its workload is per second? Please share your thoughts and drop a comment.

Read More From This Brand:

Don’t forget to follow us for more exclusive content on MSN.

If you liked this story, you’ll LOVE our FREE emails. Join today and be the first to get stories like this one.

This slideshow was made with AI assistance and human editing.

This content is exclusive for our subscribers.

Get instant FREE access to ALL of our articles.

Was this helpful?
Thumbs UP Thumbs Down
Prev Next
Share this post

Lucky you! This thread is empty,
which means you've got dibs on the first comment.
Go for it!

Send feedback to ComputerUser



    We appreciate you taking the time to share your feedback about this page with us.

    Whether it's praise for something good, or ideas to improve something that isn't quite right, we're excited to hear from you.