NVIDIA Accelerates OpenAI gpt-oss Models from Cloud to Edge

️We accelerated OpenAI’s new GPT open weight models – gpt-oss-20b and gpt-oss-120b – for leading inference performance on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second on an NVIDIA GB200 NVL72 system.⏱️🏁

The models were trained on NVIDIA H100 Tensor Core GPUs, with gpt-oss-120b requiring over 2.1 M hours and gpt-oss-20b about 10x less. NVIDIA worked with several top open-source frameworks such as Hugging Face Transformers, Ollama, and vLLM, in addition to NVIDIA TensorRT-LLM for optimized kernels and model enhancements.

We integrated gpt-oss across the software platform to meet developers’ needs; and worked with OpenAI and the community to maximize performance, adding features such as:

✅ TensorRT-LLM Gen for attention prefill, attention decode, and MoE low-latency on Blackwell.
✅ CUTLASS MoE kernels on Blackwell.
✅ XQA kernel for specialized attention on Hopper.
✅ Optimized attention and MoE routing kernels are available through the FlashInfer kernel-serving library for LLMs.
✅ OpenAI Triton kernel MoE support, which is used in both TensorRT-LLM and vLLM.

Search the NVIDIA Technical Blog for details on how to get started.

1 Like