How should I use correctly the sm_XX and compute_XX?

Hi all,

I am trying to get some CUDA code to be called from a Python package to work and it fails with the following error:

RuntimeError: CUDA error: no kernel image is available for execution on the device

Now I see that no nvcc_args have been passed while building the python package, but still should it not work even then ?

CUDA Toolkit is: 10.2.89
GPU: V100 (Datasheet says this is the Volta architecture with compute capability 7)

Now I looked at the Volta compatibility guide:

https://docs.nvidia.com/cuda/archive/10.2/volta-compatibility-guide/index.html

And I am sure I am doing something wrong in terms of settings. In the doc it says this for CUDA 9 :
/usr/local/cuda/bin/nvcc
-gencode=arch=compute_50,code=sm_50
-gencode=arch=compute_52,code=sm_52
-gencode=arch=compute_60,code=sm_60
-gencode=arch=compute_61,code=sm_61
-gencode=arch=compute_70,code=sm_70
-gencode=arch=compute_70,code=compute_70
-O2 -o mykernel.o -c mykernel.cu

Some questions:

  1. When should I use code=sm_XX and code=compute_XX or should both be used ?
  2. What should the arguments of -gencode be when I want to target a single GPU architecture without further settings ?
  3. When should the CUDA_FORCE_PTX_JIT variable be set ?

I know there are some technical details on cubin version and PTX version, but I could not make anything of it. Would be really helpful if someone can can give simple set of guidelines for each of the use-cases :)…

Thanks a lot…

When you read the section on code generation (“Building for Maximum Compability”) in the Best Practices Guide, what exactly was unclear? You may want to consult the nvcc manual in addition to the Best practices Guide.

sm_XX pertains to machine code (SASS, in CUDA parlance) for a particular GPU hardware architecture. compute_XX pertains to virtual architectures represented by the intermediate PTX format. So in your example, the compiler is instructed to produce a fat binary containing SASS for CC 5.0, CC 5.2, CC 6.0, CC6.1, and CC 7.0, as well as PTX for CC 7.0. This is a best practice: Include SASS for all architectures that the application needs to support, as well as PTX for the latest architecture (CC.7.0 for the CUDA version referenced), which can be JIT compiled when a new (as of yet unknown) GPU architecture rolls around.

If you intend to run with a CC 7.0 (Volta) GPU, the compilation options in your example should work just fine for that. If lengthy compilation times bother you, you can just pare down the list of architectures for which code is generated.

If you use the -gencode options shown, this should not happen when running with a V100 (CC 7.0). When the CUDA runtime downloads kernels into the GPU, it first looks for matching SASS in the fat binary. If it cannot find that, it looks for PTX that it can JIT compile. If JIT compilation is inhibited or no suitable PTX is found, it fails with this error. Given that the example specifies that both SASS and PTX code suitable for CC 7.0 are generated, loading the kernel(s) should not fail when the current device is a V100 GPU.

So something doesn’t add up here. You can use the cuobjdump utility to check for which GPU architectures SASS and/or PTX are present in the fat binary.

Thanks for those helpful hints and pointing the right documentation. I solved by targeting a single architecture and be done with it.

cuobjdump visibility_kernel.o

Fatbin ptx code:

arch = sm_70
code version = [6,5]
producer =
host = linux
compile_size = 64bit
compressed

Fatbin elf code:

arch = sm_70
code version = [1,7]
producer =
host = linux
compile_size = 64bit

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.