How can I make a PTX fat binary from individual PTX files?

I’m working with a large legacy code base that makes liberal use of #if __CUDA_ARCH__ >= ... in its .cu files. Currently, the build process just creates one .ptx file from each .cu for a specific architecture. However, I need to create a fat binary for each .cu file that consists of multiple architectures. When I try to put multiple -gencode ... directives in the nvcc command, bad things happen because once nvcc is compiling for multiple architectures, it no longer defines __CUDA_ARCH__. I’ve also tried to compile one .ptx file per architecture, and then combine them all with fatbinary --image=profile=compute_90,file=some_file_90.ptx .... However, this seems to result in SASS code and not PTX, which is not what I want (ideally, I would like SASS+PTX in the fat binary, but just PTX is OK too). How can I do what I want here?

I am not sure how you reached that conclusion. Here is a little program fatbinary.cu I used to explore that statement, which I cannot affirm but can refute:

[NOTE: Programming off the cuff, I named my test program fatbinary.cu here, but of course that is a really poor choice because the resulting executable could easily cause conflicts with CUDA’s fatbinary utility that is part of the toolchain. It would be better to name it fatbinary_test.cu or somesuch.]

#include <cstdio>
#include <cstdlib>

#define xstr(a) str(a)
#define str(a) #a

__global__ void kernel (void)
{
    printf ("%s\n", xstr(__CUDA_ARCH__));
}

int main (void)
{
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

I built it for the compute capability of the two GPUs in my system, then ran on either one of them (sorry about the almost impossible to read color scheme applied to the command lines: NVIDIA’s forum software picks that, not me):

C:\Users\Norbert\My Programs>nvcc -gencode arch=compute_61,code=sm_61 -gencode arch=compute_75,code=sm_75 -o fatbinary.exe fatbinary.cu
fatbinary.cu
tmpxft_00000dac_00000000-10_fatbinary.compute_75.cudafe1.cpp
   Creating library fatbinary.lib and object fatbinary.exp

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=1

C:\Users\Norbert\My Programs>fatbinary
610

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=0

C:\Users\Norbert\My Programs>fatbinary
750

Seems to work just fine.

When you use -gencode, the code= part controls what kind of code is being generated. If you specify a hardware architecture, like sm_75, SASS is emitted into the fat binary. If you specify a virtual architecture, like compute_75, PTX is emitted into the fat binary. I repeated the above experiment, but compiling to PTX only this time:

C:\Users\Norbert\My Programs>nvcc -gencode arch=compute_61,code=compute_61 -gencode arch=compute_75,code=compute_75 -o fatbinary.exe fatbinary.cu
fatbinary.cu
tmpxft_00003710_00000000-10_fatbinary.compute_75.cudafe1.cpp
   Creating library fatbinary.lib and object fatbinary.exp

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=0

C:\Users\Norbert\My Programs>fatbinary
750

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=1

C:\Users\Norbert\My Programs>fatbinary
610

One can use cubjdump {--dump-sass | --dump_ptx} to inspect what got emitted into the fat binary.

That is the correct method. And I dispute the claim that it will result in __CUDA_ARCH__ being undefined. To make a “PTX fat binary”, you would specify compilation as -gencode arch=compute_XX,code=compute_XX -gencode arch=compute_YY,code=compute_YY … and so on.

If you want SASS + PTX, then you would do: --gencode arch=compute_XX,code=compute_XX --gencode arch=compute_XX,code=sm_XX ... for each architecture that you wanted.

Here is a simple example:

# cat test.cu
__device__ void testf(){

#if __CUDA_ARCH__ == 750
#warning "compiling for cc7.5"
#endif
#if __CUDA_ARCH__ == 800
#warning "compiling for cc8.0"
#endif
}

__global__ void k(){
        testf();
}
# nvcc -c -gencode arch=compute_75,code=compute_75 test.cu -o test.o
test.cu:4:2: warning: #warning "compiling for cc7.5" [-Wcpp]
    4 | #warning "compiling for cc7.5"
      |  ^~~~~~~
# cuobjdump test.o

Fatbin ptx code:
================
arch = sm_75
code version = [8,2]
host = linux
compile_size = 64bit
compressed
# nvcc -c -gencode arch=compute_75,code=compute_75 -gencode arch=compute_80,code=compute_80 test.cu -o test.o
test.cu:4:2: warning: #warning "compiling for cc7.5" [-Wcpp]
    4 | #warning "compiling for cc7.5"
      |  ^~~~~~~
test.cu:7:2: warning: #warning "compiling for cc8.0" [-Wcpp]
    7 | #warning "compiling for cc8.0"
      |  ^~~~~~~
# cuobjdump test.o                          
Fatbin ptx code:
================
arch = sm_75
code version = [8,2]
host = linux
compile_size = 64bit
compressed

Fatbin ptx code:
================
arch = sm_80
code version = [8,2]
host = linux
compile_size = 64bit
compressed
# nvcc -c -gencode arch=compute_75,code=compute_75 -gencode arch=compute_80,code=compute_80 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 test.cu -o test.o
test.cu:4:2: warning: #warning "compiling for cc7.5" [-Wcpp]
    4 | #warning "compiling for cc7.5"
      |  ^~~~~~~
test.cu:7:2: warning: #warning "compiling for cc8.0" [-Wcpp]
    7 | #warning "compiling for cc8.0"
      |  ^~~~~~~
# cuobjdump test.o

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_75
code version = [8,2]
host = linux
compile_size = 64bit
compressed

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_80
code version = [8,2]
host = linux
compile_size = 64bit
compressed
#

Thanks for the help, I think I understand more of this now. I believe these errors are coming from the host code compilation pass.

__CUDA_ARCH__ is undefined during host code compilation. If your host code depends on it, that is almost certainly questionable/improper coding. From here:

The host code (the non-GPU code) must not depend on it.