How can I make a PTX fat binary from individual PTX files?

michael.eisel · May 10, 2024, 6:51pm

I’m working with a large legacy code base that makes liberal use of #if __CUDA_ARCH__ >= ... in its .cu files. Currently, the build process just creates one .ptx file from each .cu for a specific architecture. However, I need to create a fat binary for each .cu file that consists of multiple architectures. When I try to put multiple -gencode ... directives in the nvcc command, bad things happen because once nvcc is compiling for multiple architectures, it no longer defines __CUDA_ARCH__. I’ve also tried to compile one .ptx file per architecture, and then combine them all with fatbinary --image=profile=compute_90,file=some_file_90.ptx .... However, this seems to result in SASS code and not PTX, which is not what I want (ideally, I would like SASS+PTX in the fat binary, but just PTX is OK too). How can I do what I want here?

njuffa · May 10, 2024, 9:51pm

I am not sure how you reached that conclusion. Here is a little program fatbinary.cu I used to explore that statement, which I cannot affirm but can refute:

[NOTE: Programming off the cuff, I named my test program fatbinary.cu here, but of course that is a really poor choice because the resulting executable could easily cause conflicts with CUDA’s fatbinary utility that is part of the toolchain. It would be better to name it fatbinary_test.cu or somesuch.]

#include <cstdio>
#include <cstdlib>

#define xstr(a) str(a)
#define str(a) #a

__global__ void kernel (void)
{
    printf ("%s\n", xstr(__CUDA_ARCH__));
}

int main (void)
{
    kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}

I built it for the compute capability of the two GPUs in my system, then ran on either one of them (sorry about the almost impossible to read color scheme applied to the command lines: NVIDIA’s forum software picks that, not me):

C:\Users\Norbert\My Programs>nvcc -gencode arch=compute_61,code=sm_61 -gencode arch=compute_75,code=sm_75 -o fatbinary.exe fatbinary.cu
fatbinary.cu
tmpxft_00000dac_00000000-10_fatbinary.compute_75.cudafe1.cpp
   Creating library fatbinary.lib and object fatbinary.exp

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=1

C:\Users\Norbert\My Programs>fatbinary
610

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=0

C:\Users\Norbert\My Programs>fatbinary
750

Seems to work just fine.

When you use -gencode, the code= part controls what kind of code is being generated. If you specify a hardware architecture, like sm_75, SASS is emitted into the fat binary. If you specify a virtual architecture, like compute_75, PTX is emitted into the fat binary. I repeated the above experiment, but compiling to PTX only this time:

C:\Users\Norbert\My Programs>nvcc -gencode arch=compute_61,code=compute_61 -gencode arch=compute_75,code=compute_75 -o fatbinary.exe fatbinary.cu
fatbinary.cu
tmpxft_00003710_00000000-10_fatbinary.compute_75.cudafe1.cpp
   Creating library fatbinary.lib and object fatbinary.exp

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=0

C:\Users\Norbert\My Programs>fatbinary
750

C:\Users\Norbert\My Programs>set CUDA_VISIBLE_DEVICES=1

C:\Users\Norbert\My Programs>fatbinary
610

One can use cubjdump {--dump-sass | --dump_ptx} to inspect what got emitted into the fat binary.

Robert_Crovella · May 10, 2024, 11:34pm

That is the correct method. And I dispute the claim that it will result in __CUDA_ARCH__ being undefined. To make a “PTX fat binary”, you would specify compilation as -gencode arch=compute_XX,code=compute_XX -gencode arch=compute_YY,code=compute_YY … and so on.

If you want SASS + PTX, then you would do: --gencode arch=compute_XX,code=compute_XX --gencode arch=compute_XX,code=sm_XX ... for each architecture that you wanted.

Here is a simple example:

# cat test.cu
__device__ void testf(){

#if __CUDA_ARCH__ == 750
#warning "compiling for cc7.5"
#endif
#if __CUDA_ARCH__ == 800
#warning "compiling for cc8.0"
#endif
}

__global__ void k(){
        testf();
}
# nvcc -c -gencode arch=compute_75,code=compute_75 test.cu -o test.o
test.cu:4:2: warning: #warning "compiling for cc7.5" [-Wcpp]
    4 | #warning "compiling for cc7.5"
      |  ^~~~~~~
# cuobjdump test.o

Fatbin ptx code:
================
arch = sm_75
code version = [8,2]
host = linux
compile_size = 64bit
compressed
# nvcc -c -gencode arch=compute_75,code=compute_75 -gencode arch=compute_80,code=compute_80 test.cu -o test.o
test.cu:4:2: warning: #warning "compiling for cc7.5" [-Wcpp]
    4 | #warning "compiling for cc7.5"
      |  ^~~~~~~
test.cu:7:2: warning: #warning "compiling for cc8.0" [-Wcpp]
    7 | #warning "compiling for cc8.0"
      |  ^~~~~~~
# cuobjdump test.o                          
Fatbin ptx code:
================
arch = sm_75
code version = [8,2]
host = linux
compile_size = 64bit
compressed

Fatbin ptx code:
================
arch = sm_80
code version = [8,2]
host = linux
compile_size = 64bit
compressed
# nvcc -c -gencode arch=compute_75,code=compute_75 -gencode arch=compute_80,code=compute_80 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 test.cu -o test.o
test.cu:4:2: warning: #warning "compiling for cc7.5" [-Wcpp]
    4 | #warning "compiling for cc7.5"
      |  ^~~~~~~
test.cu:7:2: warning: #warning "compiling for cc8.0" [-Wcpp]
    7 | #warning "compiling for cc8.0"
      |  ^~~~~~~
# cuobjdump test.o

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_75
code version = [8,2]
host = linux
compile_size = 64bit
compressed

Fatbin elf code:
================
arch = sm_80
code version = [1,7]
host = linux
compile_size = 64bit

Fatbin ptx code:
================
arch = sm_80
code version = [8,2]
host = linux
compile_size = 64bit
compressed
#

michael.eisel · May 11, 2024, 12:14am

Thanks for the help, I think I understand more of this now. I believe these errors are coming from the host code compilation pass.

Robert_Crovella · May 11, 2024, 1:24pm

__CUDA_ARCH__ is undefined during host code compilation. If your host code depends on it, that is almost certainly questionable/improper coding. From here:

The host code (the non-GPU code) must not depend on it.

Topic		Replies	Views
How should I use correctly the sm_XX and compute_XX? CUDA Programming and Performance	3	5828	July 14, 2022
[SOLVED] Trying to remove PTX from a shared library leaves it there, and kernels are not executed CUDA Programming and Performance	16	1351	January 19, 2019
Fatbinary best practices CUDA Programming and Performance	6	1328	November 23, 2022
PTX in binary ? CUDA Programming and Performance	9	7799	June 20, 2011
Running PTX Code from CUDA 4.0 in CUDA 4.1 or CUDA 4.2 CUDA Programming and Performance	5	2474	May 30, 2012
How to check the Version of GPU to dynamically set '-gencode=arch=compute_?'? CUDA Programming and Performance	19	4417	August 14, 2023
Understanding code optimization resulting from the --gpu-architecture, --gpu-code and --generate-code flags CUDA NVCC Compiler	1	962	May 31, 2024
Can no longer create backward compatible CUDA binary with Titan V and CUDA 9 CUDA Setup and Installation	4	1045	August 2, 2018
nvcc cubin for multiple platforms How can I produce CUBIN for all platforms? CUDA Programming and Performance	4	2444	January 8, 2011
__CUDA_ARCH__ undefined by NVCC on CUDA 3.2 RC CUDA Programming and Performance	15	3753	November 26, 2010

How can I make a PTX fat binary from individual PTX files?

Related topics