Using device link-time optimization results in much larger fatbinaries

Let’s say I compile/link an application using device link-time optimization (available from CUDA 11.2+) using the following options:

Compile
-gencode=arch=compute_52,code=[compute_52,lto_52,lto_61]

Link
-gencode=arch=compute_52,code=[sm_52,sm_61] -dlto

As expected, this will create a fatbinary containing PTX for sm_52, LTO intermediaries for sm_52 and sm_61, and link-time optimized SASS for sm_52 and sm_61.

However, according to cuobjdump -all (output: cuobjdump.txt (4.7 KB)), the fatbinary also contains ELF code (SASS? LTO SASS?) for all GPU architectures supported by the current CUDA toolkit (e.g. sm_35 - sm_86 in the case of CUDA 11), as well as PTX for sm_86. This then obviously greatly increases the size of the resulting fatbinary.

It is unclear to me how/why these extra fatbin sections are generated? What purpose do they serve? Is there a compiler/linker flag to disable their generation?

My guess (can’t be sure without seeing your whole test case) is that the other ELF is from libcudadevrt which can be implicitly linked in. I assume that with cuobjdump , (no -all) then it just shows the ELF you requested. When you use -all it shows all relocatable ELF from all the objects that are linked in, including from libraries. The libcudadevrt library contains ELF for all the architectures, but then we only use the ELF for the requested arch when linking.

By the way, looking at some examples I see that LTO is including libcudadevrt in cases where it doesn’t need to. That will be fixed in the next release.

I noticed it on a more complex application, but can repro it with the following minimum working example:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

__global__ void kernel(int *a)
{
    int i = threadIdx.x;
    a[i] = i;
}

int main()
{
    const int size = 5;
    int* dev_a;
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    kernel<<<1, size>>>(dev_a);
    cudaDeviceSynchronize();

    return 0;
}

Non-DLTO

COMPILE: -gencode=arch=compute_52,code=compute_52 -rdc=true
LINK:    -gencode=arch=compute_52,code=sm_52

When compiled/linked without DLTO (see above), the resulting binary is 143 KB and doesn’t contain any non-sm_52 ELF code sections: cuobjdump.txt (312 Bytes).

DLTO

COMPILE: -gencode=arch=compute_52,code=compute_52 -rdc=true
LINK:    -gencode=arch=compute_52,code=sm_52 -dlto

When compiled/linked with DLTO (see above), the resulting binary is 951 KB and contains the non-sm_52 ELF code sections: cuobjdump_dlto.txt (2.1 KB).

All examples compiled using MSVC 15.9.36 with CUDA 11.3.

Looking at the command line output, both examples also seem to link against cudadevrt.lib even though it is not explicitly included in the linker command line options.