NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads

jwitsoe · September 9, 2025, 3:00pm

Originally published at: NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads | NVIDIA Technical Blog

Inference has emerged as the new frontier of complexity in AI. Modern models are evolving into agentic systems capable of multi-step reasoning, persistent memory, and long-horizon context—enabling them to tackle complex tasks across domains such as software development, video generation, and deep research. These workloads place unprecedented demands on infrastructure, introducing new challenges in compute,…

Topic		Replies	Views
Optimize AI Inference Performance with NVIDIA Full-Stack Solutions Technical Blog	1	44	January 24, 2025
Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance Technical Blog	1	46	September 27, 2024
Introducing NVIDIA Dynamo, A Low-Latency Distributed Inference Framework for Scaling Reasoning AI Models Technical Blog	3	151	May 20, 2025
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	279	March 27, 2024
NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1 Technical Blog	2	52	August 28, 2024
NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200 Technical Blog llama	2	48	November 27, 2024
NVIDIA Dynamo Accelerates llm-d Community Initiatives for Advancing Large-Scale Distributed Inference Technical Blog	1	40	May 21, 2025
Spotlight: Perplexity AI Serves 400 Million Search Queries a Month Using NVIDIA Inference Stack Technical Blog	1	32	December 5, 2024
Leading MLPerf Inference v3.1 Results with NVIDIA GH200 Grace Hopper Superchip Debut Technical Blog	1	460	October 3, 2023
NVIDIA Blackwell Ultra for the Era of AI Reasoning Technical Blog	1	40	March 19, 2025

NVIDIA Rubin CPX Accelerates Inference Performance and Efficiency for 1M+ Token Context Workloads

Related topics