NVIDIA Accelerates OpenAI gpt-oss Models from Cloud to Edge

TomNVIDIA · August 5, 2025, 5:50pm

️We accelerated OpenAI’s new GPT open weight models – gpt-oss-20b and gpt-oss-120b – for leading inference performance on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second on an NVIDIA GB200 NVL72 system.⏱️🏁

The models were trained on NVIDIA H100 Tensor Core GPUs, with gpt-oss-120b requiring over 2.1 M hours and gpt-oss-20b about 10x less. NVIDIA worked with several top open-source frameworks such as Hugging Face Transformers, Ollama, and vLLM, in addition to NVIDIA TensorRT-LLM for optimized kernels and model enhancements.

We integrated gpt-oss across the software platform to meet developers’ needs; and worked with OpenAI and the community to maximize performance, adding features such as:

✅ TensorRT-LLM Gen for attention prefill, attention decode, and MoE low-latency on Blackwell.
✅ CUTLASS MoE kernels on Blackwell.
✅ XQA kernel for specialized attention on Hopper.
✅ Optimized attention and MoE routing kernels are available through the FlashInfer kernel-serving library for LLMs.
✅ OpenAI Triton kernel MoE support, which is used in both TensorRT-LLM and vLLM.

Search the NVIDIA Technical Blog for details on how to get started.

Topic		Replies	Views
Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72, NVIDIA Accelerates OpenAI gpt-oss Models from Cloud to Edge Technical Blog	1	107	August 5, 2025
NVIDIA, GB200 NVL72로 OpenAI gpt-oss 모델을 클라우드부터 엣지까지 초당 150만 토큰 속도로 가속 Technical Blog - South Korea	1	66	August 12, 2025
NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance Technical Blog	3	231	July 17, 2025
NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs Technical Blog	5	1182	September 27, 2023
NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1 Technical Blog	2	119	August 28, 2024
NVIDIA TensorRT-LLM Revs Up Inference for Google Gemma Technical Blog	1	289	February 21, 2024
NVIDIA H200 Tensor Core GPUs and NVIDIA TensorRT-LLM Set MLPerf LLM Inference Records Technical Blog	1	332	March 27, 2024
Gpt-oss 20b not working on RTX 4090 with 24Gb RAM NIM on RTX AI PCs and Workstations jetson , nim , deepseek , nemotron	0	249	September 8, 2025
NVIDIA Blackwell Doubles LLM Training Performance in MLPerf Training v4.1 Technical Blog	1	100	November 13, 2024
Blackwell Breaks the 1,000 TPS/User Barrier With Meta’s Llama 4 Maverick Technical Blog llama	3	179	September 10, 2025

NVIDIA Accelerates OpenAI gpt-oss Models from Cloud to Edge

Related topics