Troubleshooting uncommon error 98 "invalid device function"

I need your help in troubleshooting an issue, for which the internet speaks of unrelated problems. Unfortunately, I the issue only happens in a large codebase and small reproducers compile and run just fine.

So I’m getting CUDA error 98 “invalid device function” in CUDA 11.3.1 on Ubuntu 20.04 for a program configured with CMake 3.20.3 . The g++ is at version 9.3.0 . The CUDA cards on my laptop and on our server are both of compute capability 7.5 (Geforce RTX 2080 and NVIDIA Quadro RTX 8000 correspondingly). I pass command-line option -DCMAKE_CUDA_ARCHITECTURES="75" to cmake, and I see the compiler switch --generate-code=arch=compute_75,code=[compute_75,sm_75] in the nvcc command line for compiling .cu files. The software consists of an executable with .cpp files only, and multiple libraries with .cu files. The following options are used commonly in cmake:

set_target_properties(${target_name} PROPERTIES
  CUDA_SEPARABLE_COMPILATION ON
  CUDA_STANDARD 17
  POSITION_INDEPENDENT_CODE ON
  # CUDA_RESOLVE_DEVICE_SYMBOLS ON
)

I.e. I tried to force a device link for some libraries and the executable, but this didn’t help. Though currently our software does not really need to link __device__ functions across targets. For some targets, I have to turn on the CUDA_RESOLVE_DEVICE_SYMBOLS option because otherwise undefined references to CUDA stubs appear during linking.

I have roughly the following code in a library:

__global__ void TestKernel(const int a, const int b) {
  printf("%d", a+b);
}

void TestLaunch() {
  TestKernel<<<2, 2>>>(3, 4);
  cudaDeviceSynchronize();
  cudaGetLastError(); // The real code checks for the error
}

In the executable I have roughly the following code in the very beginning of main function:

cudaSetDevice(0);
TestLaunch();

When running this code on both machines I get the following output from cuda-memcheck (and another similar message about cudaGetLastError call):

========= Program hit cudaErrorInvalidDeviceFunction (error 98) due to "invalid device function" on CUDA API call to cudaLaunchKernel.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x355c53]
=========     Host Frame:/scratch/src/katana-enterprise/build/libdistgalois/libkatana_dist_galois.so [0x31b62f]
=========     Host Frame:/scratch/src/katana-enterprise/build/libdistgalois/libkatana_dist_galois.so [0x29a4d1]
=========     Host Frame:/scratch/src/katana-enterprise/build/libdistgalois/libkatana_dist_galois.so (_Z30__device_stub__Z10TestKerneliiii + 0x125) [0x2975b2]
=========     Host Frame:/scratch/src/katana-enterprise/build/libdistgalois/libkatana_dist_galois.so (_Z10TestKernelii + 0x21) [0x2975fb]
=========     Host Frame:/scratch/src/katana-enterprise/build/libdistgalois/libkatana_dist_galois.so (_Z10TestLaunchv + 0x104) [0x29677b]
=========     Host Frame:./pagerank-cli-dist [0x451bf]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf3) [0x270b3]
=========     Host Frame:./pagerank-cli-dist [0x4486e]
=========

cuobjdump -ptx libdistgalois/libkatana_dist_galois.so has the following lines in it:

Fatbin ptx code:
================
arch = sm_75
code version = [7,3]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

.version 7.3
.target sm_75
.address_size 64

and then

.visible .entry _Z10TestKernelii(
.param .u32 _Z10TestKernelii_param_0,
.param .u32 _Z10TestKernelii_param_1
)
{
.local .align 8 .b8 __local_depot2[8];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .b32 %r<5>;
.reg .b64 %rd<5>;


mov.u64 %SPL, __local_depot2;
cvta.local.u64 %SP, %SPL;
ld.param.u32 %r1, [_Z10TestKernelii_param_0];
ld.param.u32 %r2, [_Z10TestKernelii_param_1];
add.u64 %rd1, %SP, 0;
add.u64 %rd2, %SPL, 0;
add.s32 %r3, %r2, %r1;
st.local.u32 [%rd2], %r3;
mov.u64 %rd3, $str;
cvta.global.u64 %rd4, %rd3;
{ 
	.reg .b32 temp_param_reg;
.param .b64 param0;
st.param.b64 [param0+0], %rd4;
.param .b64 param1;
st.param.b64 [param1+0], %rd1;
.param .b32 retval0;
call.uni (retval0), 
vprintf, 
(
param0, 
param1
);
ld.param.b32 %r4, [retval0+0];
} 
	ret;

}

And cuobjdump -sass libdistgalois/libkatana_dist_galois.so has the following lines in it:

Fatbin ptx code:
================
arch = sm_75
code version = [7,3]
producer = <unknown>
host = linux
compile_size = 64bit
compressed

Fatbin elf code:
================
arch = sm_75
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

And then

		Function : _Z10TestKernelii
	.headerflags    @"EF_CUDA_SM75 EF_CUDA_PTX_SM(EF_CUDA_SM75)"
        /*0000*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;   /* 0x00000a00ff017624 */
                                                                             /* 0x000fc800078e00ff */
        /*0010*/                   IMAD.MOV.U32 R0, RZ, RZ, c[0x0][0x164] ;  /* 0x00005900ff007624 */
                                                                             /* 0x000fe200078e00ff */
        /*0020*/                   IADD3 R1, R1, -0x8, RZ ;                  /* 0xfffffff801017810 */
                                                                             /* 0x000fe20007ffe0ff */
        /*0030*/                   UMOV UR4, 0x0 ;                           /* 0x0000000000047882 */
                                                                             /* 0x000fe40000000000 */
        /*0040*/                   UMOV UR5, 0x0 ;                           /* 0x0000000000057882 */
                                                                             /* 0x000fe20000000000 */
        /*0050*/                   IADD3 R0, R0, c[0x0][0x160], RZ ;         /* 0x0000580000007a10 */
                                                                             /* 0x000fe20007ffe0ff */
        /*0060*/                   IMAD.U32 R4, RZ, RZ, UR4 ;                /* 0x00000004ff047e24 */
                                                                             /* 0x000fe2000f8e00ff */
        /*0070*/                   IADD3 R6, P0, R1, c[0x0][0x20], RZ ;      /* 0x0000080001067a10 */
                                                                             /* 0x000fe20007f1e0ff */
        /*0080*/                   IMAD.U32 R5, RZ, RZ, UR5 ;                /* 0x00000005ff057e24 */
                                                                             /* 0x000fc8000f8e00ff */
        /*0090*/                   IMAD.X R7, RZ, RZ, c[0x0][0x24], P0 ;     /* 0x00000900ff077624 */
                                                                             /* 0x000fe200000e06ff */
        /*00a0*/                   STL [R1], R0 ;                            /* 0x0000000001007387 */
                                                                             /* 0x0001e80000100800 */
        /*00b0*/                   MOV R20, 0x0 ;                            /* 0x0000000000147802 */
                                                                             /* 0x000fe40000000f00 */
        /*00c0*/                   MOV R21, 0x0 ;                            /* 0x0000000000157802 */
                                                                             /* 0x000fcc0000000f00 */
        /*00d0*/                   CALL.ABS.NOINC 0x0 ;                      /* 0x0000000000007943 */
                                                                             /* 0x001fea0003c00000 */
        /*00e0*/                   EXIT ;                                    /* 0x000000000000794d */
                                                                             /* 0x000fea0003800000 */
        /*00f0*/                   BRA 0xf0;                                 /* 0xfffffff000007947 */
                                                                             /* 0x000fc0000383ffff */
		..........

Please, tell me what else I can provide to debug this issue. Unfortunately, the source code is proprietary.

EDIT1: I have to add that we have another library for which kernel launches work fine. It is just libdistgalois that is a problem.

EDIT2: I have just tried to place a kernel into the same executable module (i.e. not libdistgalois library), and it works fine. I created another CMake target - a clean executable that would call the kernel from libdistgalois - and the kernel launch fails with the same error.

SOLVED! Apparently, the issue goes away if I link cudart library to libdistgalois in CMake. It’s strange that CMake doesn’t link cudart by default. Now, can someone explain why such an issue happens if cudart is not linked to a library?