Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp

Applications must update to the latest AI frameworks to ensure compatibility with NVIDIA Blackwell RTX GPUs. This guide provides information on the updates to the core software libraries required to ensure compatibility and optimal performance with NVIDIA Blackwell RTX GPUs.

CUDA 12.8

Running any NVIDIA CUDA workload on NVIDIA Blackwell requires a compatible driver (R570 or higher).

  • If your application bundles PTX, your code will just-in-time compile (JIT) to Blackwell. To ensure best performance, we recommend rebuilding your application with CUDA Toolkit 12.8 or newer.
  • Otherwise, you may see stability issues with Blackwell and should update to CUDA Toolkit 12.8.

If you fall within the above criteria, you can either update your application to bundle PTX and update to the latest CUDA PTX, or recompile your application with CUDA Toolkit 12.8.

CUDA 12.8 is the first CUDA version that natively supports Blackwell (compute capabilities 10.0 and 12.0). Applications built with CUDA Toolkit 12.8 will run natively on any driver R525 or higher due to CUDA’s minor version compatibility guarantees - applications built with any toolkit in the 12.x major release will run against any driver released in that window unless the application makes use of specific APIs tied to new drivers (this is not common). Refer to the CUDA Compatibility document for the latest information on CUDA compatibility. Refer to the NVIDIA CUDA Compiler Driver NVCC for how the CUDA compilation works.

Building Forward-Compatible CUDA Applications:

To build an application that will JIT forward to future NVIDIA GPUs, we recommend building as follows:

  • Ship PTX to enable code written for Blackwell GPUs to JIT compile to future architectures. Kernels will be compiled upon loading them using the GPU driver - this adds a small amount of additional latency for the first load only on a new GPU but ensures your application will continue to run with no update. The latency completely depends on the number and complexity of the kernels used.
  • Our recommendation is that, at a minimum, to ship low-versioned PTX for infrequently used but production GPUs, SASS (compute architecture assembly) for the GPUs widely used among your user base, and an additional PTX (virtual architecture) version targeting the latest GPU architecture to support future GPUs with the best performance possible. Let’s say our main user base is now on NVIDIA Ampere GPU architecture (86), NVIDIA Ada Lovelace (89), and Blackwell (120), but we want to make sure older GPUs still work and future GPUs after Blackwell can use the features of Blackwell. We can build as in the example in the “Recompiling your application with CUDA Toolkit 12.8” section.

Recompiling your application with CUDA Toolkit 12.8

nvcc -gencode arch=compute_52,code=compute_52
-gencode arch=compute_86,code=sm_86
-gencode arch=compute_89,code=sm_89
-gencode=arch=compute_120,code=sm_120
-gencode=arch=compute_120,code=compute_120
main.cu -o main

Additional Information on CUDA Toolkit and Math Libraries

Our math libraries have the same challenges with cubin vs. PTX. Be aware that kernels leveraging Tensor Cores that are highly specialized to a specific architecture should ideally never run in forward compatibility mode.

  • NVIDIA cuDNN
    • Since builds of cuDNN version 9 are based on CUDA 12 or higher, they are hardware-forward compatible. For more information, refer to the NVIDIA cuDNN documentation.
    • Compiling cuDNN PTX on-demand adds significant latency and does not guarantee full performance on future GPUs, so it is strongly recommended that you upgrade.
  • NVIDIA cuBLAS and NVIDIA cuFFT
    • Both libraries include PTX code and are forward-compatible with any new GPU architecture. However, it is strongly recommended that the library be upgraded for full performance, as newer architectures require different optimizations, especially concerning new Tensor Core instructions.
    • For cuBLAS, starting in CUDA 12.8, the new narrow precisions will have limited forward compatibility.

TensorRT

TensorRT 10.8 supports NVIDIA Blackwell GPUs and adds support for FP4. If you have not yet upgraded to TensorRT 10.x from 8.x, ensure you know the potential breaking API changes. The TensorRT API Migration Guide comprehensively lists deprecated APIs and changes.

Deploying Engines

TensorRT engines behave similarly to CUDA kernels. A normal TensorRT engine only contains cubin code, while forward-compatible hardware can be considered PTX code. An additional challenge for engines compiled when using 10.x is that there can be engines dependent on the specific device’s SM count. This reduces their compatibility to run only on devices with more SMs than the builder device.

  • Build on xx60 works on xx60, xx70, xx80, xx90
  • Build on xx80 works on xx80, xx90
  • Build on xx90 work on xx90

To mitigate this issue, refer to the Hardware Compatibility section.

Software Forward Compatibility

By default, TensorRT engines are compatible only with the version of TensorRT used to build them. However, enabling Version Compatibility during the build process allows forward compatibility with future TensorRT versions. This comes at a potential reduction in throughput as running the lean runtime constrains the available operator implementations. For more information, refer to the Runtime Options section.

Build On Device

TensorRT uses auto-tuning to determine the fastest possible execution path on a given GPU. Since these optimizations vary by GPU SKU, building engines directly on the end-user device ensures the best performance and compatibility.
Building on a device requires nvinfer_builder_resources.dll to be present.

Pre-Building Engines

Alternatively, you can pre-build engines and include them in your application to avoid build times on end-user devices. NVIDIA offers the TensorRT-Cloud service, which provides access to various RTX GPUs for building engines.

Trade-offs in Pre-Building

While building engines per SKU is ideal for optimal performance, it is often impractical. Instead, you can:

  1. Build within an architecture: Engines built with TensorRT 10.8 are compatible with GPUs with >=#SM and the same architecture. For example, an engine built on an RTX 4060 can run on RTX 4070–4090, provided the model fits in the available VRAM. This approach ensures good performance within the same GPU class but may introduce performance variations when running on different classes.
  2. Empirically determine performance: Performance differences across GPUs should be tested, as these vary by model. Generally, GPUs with similar characteristics (for example, RTX 4060/4070 or RTX 4080 Ti/4090) exhibit smaller performance gaps.

Strategies for Building Engines

The total number of engines is calculated using the formula below, and a decision is made per use case based on the considerations below.

#models * #computeCapability * #enginesPerComputeCapability

#models is the number of ONNX files or INetworkDefinitions your application contains.

N Engines per Architecture

To mitigate performance variations, we might want to increase the #enginesPerComputeCapability:

  • For instance, build one engine on an RTX xx60 to cover the low-mid end (for example, RTX xx60–xx70 Ti) and another on an RTX xx80 to cover the mid-high end (for example, RTX xx80–xx90).

Use weight stripping to avoid duplicating model weights across engines (details in the next section).

Hardware Compatibility - Single Engine per Model

TensorRT’s Hardware Compatibility Mode enables building portable engines for Ampere+ GPUs (#computeCapability=1). In this mode, TensorRT excludes architecture-specific tactics that require more shared memory than certain devices can provide. This mode also relaxes the requirement that the build device have fewer SMs than the target device. However, hardware-compatible engines may experience reduced throughput or increased latency compared to non-compatible engines. The degree of impact depends on the network architecture and input sizes. Refer to the section on Hardware Compatibility in the Developer Guide for an in-depth discussion.

Coming in TensorRT 10.9, hardware compatibility will be extended to support compatibility within a single architecture for a better performance trade-off.

Building Engines with Plan Stripping

Using weight stripping removes constant weights from an engine, reducing its binary size by up to 99%. This process does not affect performance. To deploy with weight stripping:

  1. Include the original ONNX model in the application.
  2. Refit the stripped engine on the end-user device using the ONNX model.

TensorRT supports Fine-Grained Refit Builds for advanced use cases, which allow partial updates of specific weights without regenerating the entire engine. For more information, refer to the Refitting an Engine section.

This approach ensures reduced deployment size while maintaining flexibility and performance.

Outlook

Moving forward, TensorRT will introduce JIT capabilities to simplify the process of deploying TensorRT whilst keeping performance intact and making engines more portable. Stay tuned for updates here.

ONNX Runtime

CUDA Execution Provider

The CUDA Execution Provider does not include PTX; we recommend compiling the library from source against CUDA 12.8 and updating all math libs (cuDNN, cuBLAS, etc.) to the version released for CUDA 12.8.

DML Execution Provider

The DML Execution Provider only requires a Blackwell-compatible driver (driver R570 or higher) and will run out of the box at full performance.

TensorRT Execution Provider

Use a binary-compatible version of TensorRT 10.x. Ensure you are familiar with the deployment constraints in the following TensorRT section. If compiling from source, we recommend directly compiling against 10.8.

llama.cpp

Llama.cpp is compatible with the latest Blackwell GPUs, for maximum performance we recommend the below upgrades, depending on the backend you are running llama.cpp with.

CUDA Backend

Building with CUDA 12.8 for compute capability 120 and an upgraded cuBLAS avoids PTX JIT compilation for end users and provides Blackwell-optimized cuBLAS routines. Compilation for compute capability 120 can be achieved by including 120 in the CMake option CMAKE_CUDA_ARCHITECTURES, for example, CMAKE_CUDA_ARCHITECTURES=52-virtual;75-real;86-real;89-real;120.

Vulkan Backend

For best performance, use an up-to-date llama.cpp to include December 2024 optimizations with VK_NV_cooperative_matrix2 (especially vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and FlashAttention2 by jeffbolznv · Pull Request #10206 · ggerganov/llama.cpp · GitHub), which enables the use of tensor core in the Vulkan backend on RTX GPUs.

PyTorch

PyPi

To use PyTorch natively on Windows with Blackwell, a PyTorch build with CUDA 12.8 is required. PyTorch will provide the builds soon. For a list of the latest available releases, refer to the Pytorch documentation.

To use PyTorch for Linux x86_64 on NVIDIA Blackwell RTX GPUs use the latest nightly builds, or the command below.

unset
pip install --pre torch --index-url
https://download.pytorch.org/whl/nightly/cu128

WSL 2

For the best experience, we recommend using PyTorch in a Linux environment as a native OS or through WSL 2 in Windows. To start with WSL 2 on Windows, refer to Install WSL 2 and Using NVIDIA GPUs with WSL2.

Docker

For Day 0 support, we offer a pre-packed container containing PyTorch with CUDA 12.8 to enable Blackwell GPUs. The container can be found on NGC with the 25.01 tag.

3 Likes