Embedding PTX with nvfortran to support future devices

I’m using nvfortran 23.11 on Linux and utilize OpenACC for GPU-support.

We currently create executables for some specific compute-capabilities.
(-gpu=cc50,cc60,cc61,cc70,cc75,cc80,cc86,cc89,cc90,keep)

Therefore our executables do not support future GPU’s with higher compute-capabilities like cc100 and cc120 (Blackwell)

I read, that one can support unknown future GPU’s by embedding PTX in the executable.

I don’t know how to do it with nvfortran. Can somebody give me the command-line-switches I have to set?

nvcc supports “-gencode” to do so. Does this feature also exist for nvfortran?

Regards
Benedikt

Hi Benedikt and welcome!

As of mid-2022, we embed PTX automatically. Specifically:

  • Without “-gpu=ccNN”, the added PTX is for the current compute capability
  • With “-gpu=ccNN,ccNN,etc.”, the PTX is generated from the highest CC specified (cc90 in your case)
  • With “-gpu=ccall”, the PTX is the highest CC from the supported list at compile time.

Note that while PTX support is there, I would recommend you update your compiler to our latest release, 25.9, since we have native binary support for Blackwell.

Also, there has been a few bugs in the PTX generation, (there’s an open one scheduled to be fixed in 25.11). Not that I’d expect you to encounter one, but moving to a newer compiler should decrease the likelihood.

Hope this helps,
Mat

We are currently working on migrating to version 25.11 but ran into another problem. Maybe we’ll open another ticket for this one.

The point is: Our customer still gets the error-message

Rebuild this file with -gpu=cc120 to use NVIDIA Tesla GPU 0

when he starts our executable, compiled with nvfortran 23.11. (He exported CUDA_FORCE_PTX_JIT=1).

Could you confirm:

  • The executable should run on Blackwell-GPUs. (We understood the hole concept of PTX-embedding correctly.)
  • This feature (PTX is embedded and runs on newer GPUs) is not only supported by CUDA-Fortran, but also by OpenACC. (We don’t use CUDA-Fortran)

Regards

Benedikt

In theory, yes, and is what engineering has told me. However when I just tried it myself, I see the same issue as you. This even occurs with the big hammer “-gpu=nordc” which only uses embedded PTX. I confirmed that the PTX is there, I’m just not sure why it’s failing with the JIT at runtime.

They did give me the caveat that there were issues with earlier releases, so this might be one of those issues. I show it working as expect with 25.3 and later, so updating the compiler version to a more recent version should help.

Let me know what issues you’re having with the latest version (presumably you meant 25.9, not 25.11), and we can work through them.

This feature (PTX is embedded and runs on newer GPUs) is not only supported by CUDA-Fortran, but also by OpenACC. (We don’t use CUDA-Fortran)

Yes, it would be the same for all the offload models we support, OpenACC, OpenMP, CUDA Fortran and standard language parallelism (STDPAR).

-Mat

Hi Mat,

I’m a colleague of Benedikt and working on the same project. First, let me thank you for your detailed answer!

As Benedikt mentioned, we would like to have a single executable, that runs on most GPU generations as possible. We and our customers use very different generations running either on native Linux or WSL on Windows machines. And one of our customers switched to a Blackwell GPU, which is why this thread started.

Meanwhile, I’ve been trying to use nvfortran with the current HPC Toolkit 25.9. and ran into several version problems:

  1. When using nvfortran 25.9 with the included CUDA Toolkit, it is built with CUDA 13. My dev PC is a Windows machine with WSL. The newest vendor specific GPU driver for my Ampere card comes with CUDA 12.8. So I can not run the executable.
  2. I then tested with CUDA Toolkit 12.1 by setting the variable NVHPC_CUDA_HOME=/usr/local/cuda-12.1. This compiles and runs fine with -gpu=ccall on my machine. I also tested with an older compute capability setting -gpu=cc50. This runs fine on my Ampere card having cc86, so I conclude, that PTX is working, too.
  3. To see, if I can include binary code for cc120 (Blackwell), I tested with CUDA Toolkit 12.8 by setting the variable NVHPC_CUDA_HOME=/usr/local/cuda-12.8 and could not compile with -gpu=ccall or -gpu=cc120. Error message is:
    nvdd… looked for llvm-as at /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/compilers/bin/tools/nvvm-next/llvm-as
    NVFORTRAN-F-0155-Compiler failed to translate accelerator region

For now, I think, we will be going with solution 2, maybe step up to CUDA Toolkit 12.6. But, maybe i’m missing something.
Do you have any other advice? Or maybe a way around the error message when using CUDA Toolkit 12.8 to include binary cc120 code?

Regards,
Michael

Hi Michael and welcome!

  1. Due to size, there are two different download packages. The smaller on only contains the latest CUDA (13.0 in the 25.9 case), and a second which contains the latest and previous major version of CUDA (13.0 and 12.9 in 25.9). I suspect you downloaded the first. Now I not 100% sure the CUDA 12.9 will work with the WSL 12.8 Driver (likely but not something I tested), but you can try downloading the “cuda_multi” package:

wget https://developer.download.nvidia.com/hpc-sdk/25.9/nvhpc_2025_259_Linux_x86_64_cuda_multi.tar.gz

  1. Using NVHPC_CUDA_HOME should be fine to use as well. It’s primarily there for use with newer CUDA versions, but so long as you don’t go back too far, previous CUDAs work as well.

  2. I’m not sure about this one. The error is coming from device assembler so it’s possible the CUDA 12.8 device code generator (libnvvm) has some incompatibility with the latest assembler. If you can try the CUDA 12.9 that ships with 25.9, that would be appreciated. If the error persists, I’ll see if I can reproduce the error and determine if there is a work around.

-Mat

Hi Mat,

Thanks for your suggestions and sorry for my late response. I just got back to check which version of the toolkit we are using.

I installed the HPC toolkit via apt in Ubuntu with
sudo apt install nvhpc-25-9-cuda-multi
which I believe is the same “bigger” version, you suggested. I found both CUDA 12.9 and 13.0 folders in the installation.

However, compiling with -gpu=ccall,cuda12.9 gives the same error as when using the separate CUDA12.8 toolkit with NVHPC_CUDA_HOME=/usr/local/cuda-12.8 . The llvm assembler cannot be found:

nvdd-Error-Required tool llvm-as was not found
nvdd… looked for llvm-as at /opt/nvidia/hpc_sdk/Linux_x86_64/25.9/compilers/bin/tools/nvvm-next/llvm-as
NVFORTRAN-F-0155-Compiler failed to translate accelerator region

The folder name “nvvm-next” puzzles me. Could it be, that there is something missing in the HPC Toolkit installation?

Regards, Michael

Turns out that this was a packaging issue in our initial 25.9.0 release. I knew they needed to repost a new 25.9.1 but didn’t know what the error was. This is it.

While I’m waiting for more details from engineering, you might try doing an “apt-get update” to see if that updates your install to 29.5.1, but if not, you may need to redownload and reinstall the package.

-Mat

Perfect! I updated to 25.9-1 via apt. The switch -gpu=ccall,cuda12.9 is working now. As far as I see, it includes binary code from cc50 up to cc121. I tested the executabe on two machines with older CUDA runtimes 12.7 and 12.8 and it is running fine. Thanks again, Mat!

1 Like