Latest driver breaks fatbinaries using device link-time optimization

phw89 · November 14, 2022, 4:39pm

We have an application where we use device link-time optimization (DLTO). We generate a fatbinary containing PTX for the lowest arch (e.g. sm_52), and LTO and SASS for a number of explicit architectures (e.g. sm_52 and sm_61), using the following options:

Compile: -gencode=arch=compute_52,code=[compute_52,lto_52] -gencode=arch=compute_61,code=lto_61
Link: -dlto -gencode=arch=compute_52,code=sm_52 -gencode=arch=compute_61,code=sm_61

Previously, when running the application on a later GPU arch that wasn’t explicitly included in the fatbinary (e.g. sm_86), the driver (v516.59) would JIT compile/link the application and it would run fine. However, after upgrading to the v526.67 driver, the application fails with a “device kernel image is invalid” error.

Is this a bug in the latest NVIDIA GPU driver, or should we be using different compiler/linker options?

This can be reproduced using the above compiler/linker options with the following minimal example:

#include <iostream>
#include <cuda_runtime_api.h>

__device__ int d_result;
__global__ void kernel(const int n)
{
	d_result = n;
}

int main()
{
    const int n = rand();
    kernel<<<1, 1>>>(n);
    cudaError cuda_status = cudaGetLastError();
    if (cuda_status != cudaSuccess)
    {
        std::cout << "FAIL: " << cudaGetErrorString(cuda_status) << std::endl;
        return 1;
    }
    std::cout << "PASS";
    return 0;
}

njuffa · November 14, 2022, 8:41pm

This is not the point of the question asked, but I am wondering what the rationale for this is? Conventional wisdom is that in order to future-proof a fat binary, one wants to include SASS/LTO for any GPU architectures specifically supported by the application and PTX for the latest architecture supported by the tool chain to cover any future GPU architectures.

phw89 · November 15, 2022, 12:29pm

Interesting! I’m not sure if the docs have changed since I last looked at them (years ago now ha), but they certainly seem to suggest the approach you outline. To avoid potentially derailing this thread, I have created a new topic here: Fatbinary best practices

The suggested approach would certainly fix the issue in this case. However, it still looks like a bug or otherwise weird behaviour in the new driver, given that the previous driver version was successfully JIT-compiling the compute_52 PTX, whereas the new driver is failing (either to compile the compute_52 PTX, or is erroneously trying to JIT the lto_61 NVVM IR).

Robert_Crovella · November 17, 2022, 11:44pm

I can see you have filed bug 3869117. I’m going to let that run its course. They are working on it.

phw89 · November 22, 2022, 8:41am

Yup, the NVIDIA driver team has now identified the issue and it will be fixed in the driver version that is released/supported by the upcoming CUDA Toolkit 12.0 release.

system · December 6, 2022, 8:42am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fatbinary best practices CUDA Programming and Performance	6	1552	November 23, 2022
Does the JIT compiler perform device link-time optimization? CUDA Programming and Performance	3	1161	November 23, 2022
Using device link-time optimization results in much larger fatbinaries CUDA Programming and Performance	4	620	September 21, 2021
CUDA 12.0 Compiler Support for Runtime LTO Using nvJitLink Library Technical Blog	6	709	August 22, 2024
Cuda nvJitLink error because fatbin does not contains the correct function CUDA Programming and Performance	4	157	September 11, 2024
Link-time optimization with CUDA on Linux (-flto) CUDA Programming and Performance	7	5268	May 31, 2019
CUDA 12/13 `-arch` flag no longer produces "universal" binaries CUDA Programming and Performance	8	290	September 25, 2025
Runtime compiling+linking CUDA Programming and Performance	2	487	August 10, 2023
How to include SASS code to fat binary for latest GPU not supported by my current old NVCC CUDA Programming and Performance	4	824	January 20, 2022
Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization Technical Blog	16	1702	September 6, 2024

Latest driver breaks fatbinaries using device link-time optimization

Related topics