Latest driver breaks fatbinaries using device link-time optimization

We have an application where we use device link-time optimization (DLTO). We generate a fatbinary containing PTX for the lowest arch (e.g. sm_52), and LTO and SASS for a number of explicit architectures (e.g. sm_52 and sm_61), using the following options:

Compile: -gencode=arch=compute_52,code=[compute_52,lto_52] -gencode=arch=compute_61,code=lto_61
Link: -dlto -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_61,code=sm_61

Previously, when running the application on a later GPU arch that wasn’t explicitly included in the fatbinary (e.g. sm_86), the driver (v516.59) would JIT compile/link the application and it would run fine. However, after upgrading to the v526.67 driver, the application fails with a “device kernel image is invalid” error.

Is this a bug in the latest NVIDIA GPU driver, or should we be using different compiler/linker options?

This can be reproduced using the above compiler/linker options with the following minimal example:

#include <iostream>
#include <cuda_runtime_api.h>

__device__ int d_result;
__global__ void kernel(const int n)
{
	d_result = n;
}

int main()
{
    const int n = rand();
    kernel<<<1, 1>>>(n);
    cudaError cuda_status = cudaGetLastError();
    if (cuda_status != cudaSuccess)
    {
        std::cout << "FAIL: " << cudaGetErrorString(cuda_status) << std::endl;
        return 1;
    }
    std::cout << "PASS";
    return 0;
}

This is not the point of the question asked, but I am wondering what the rationale for this is? Conventional wisdom is that in order to future-proof a fat binary, one wants to include SASS/LTO for any GPU architectures specifically supported by the application and PTX for the latest architecture supported by the tool chain to cover any future GPU architectures.

Interesting! I’m not sure if the docs have changed since I last looked at them (years ago now ha), but they certainly seem to suggest the approach you outline. To avoid potentially derailing this thread, I have created a new topic here: Fatbinary best practices

The suggested approach would certainly fix the issue in this case. However, it still looks like a bug or otherwise weird behaviour in the new driver, given that the previous driver version was successfully JIT-compiling the compute_52 PTX, whereas the new driver is failing (either to compile the compute_52 PTX, or is erroneously trying to JIT the lto_61 NVVM IR).

I can see you have filed bug 3869117. I’m going to let that run its course. They are working on it.

1 Like

Yup, the NVIDIA driver team has now identified the issue and it will be fixed in the driver version that is released/supported by the upcoming CUDA Toolkit 12.0 release.