Send device pointer to static lib kernel: CUDA error 718?

I’m developing an executable that incorporates a static lib containing CUDA functions/kernels.

In my own application (not the static lib) I have a device function compiled in a .cu file. I use the following code to copy the device function pointer to the host (within the .cu file):

typedef void (*fn)(float *f);

__device__ void myFn(float *f){f[0] *= 10;}

__device__ fn fnPtr = myFn;

//

fn getFnPtr() {
fn hostFnPtr = nullptr;
auto cudaStatus = cudaMemcpyFromSymbol(&hostFnPtr, fnPtr, sizeof(fn), 0, cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) { return nullptr; } else{ return hostFnPtr; }
}

I call getFnPtr() in my own code, then pass the returned function pointer to the static lib using one of its host functions which takes the function pointer and later passes it to a kernel and calls the device function inside the kernel.

However, when I do this (pass the device pointer to the lib and have the lib execute it in a kernel), CUDA crashes with error 718.

I’m guessing this is due to some issue with my executable’s device function not accessible from the static lib’s CUDA context? Is there a way to make this work? I tried enabling rdc in both the static lib and my executable, but then I get error 700 when I repeat the same steps that previously generated error 718…

The nature of a device-link step is that you are only permitted to run it once on any particular section of your code. Thereafter, no further device-linking of that section is possible. So if you run it first on the static lib, and then later on the executable, the device-link step for the executable will not modify any of the device-linking done during the linking of the static lib. You can do multiple device link steps during the assembly of an executable, but each device-link step should govern a disjoint section of code (a section of code which only has calls or linkages to itself).

The solution is therefore to hold off on the device link step until the final phase. Here is an example, similar to what is in the nvcc manual:

$ cat t13.cu
#include <cstdio>
typedef void (*fn)(float *f);
__global__ void k1(fn a, float *b);
__global__ void k2();

__device__ void f1(float *f) {printf("f1: %f\n", (*f) * (*f)); }
__device__ void f2(float *f) {printf("f2: %f\n", (*f) + (*f)); }

__device__ fn f1ptr = f1;
__device__ fn f2ptr = f2;

int main(){

  float *b;
  cudaMallocManaged(&b, sizeof(float));
  *b = 0.3f;
  fn f1p, f2p;
  cudaMemcpyFromSymbol(&f1p, f1ptr, sizeof(fn));
  cudaMemcpyFromSymbol(&f2p, f2ptr, sizeof(fn));
  k1<<<1,1>>>(f1p, b);
  k1<<<1,1>>>(f2p, b);
  k2<<<1,1>>>();
  cudaDeviceSynchronize();
  }


$ cat t13a.cu
#include <cstdio>
typedef void (*fn)(float *f);
__global__ void k1(fn a, float *b){
  (*a)(b);
}
$ cat t13b.cu
#include <cstdio>
__global__ void k2(){
  printf("k2\n");
}
$ nvcc -arch=sm_75 -dc t13a.cu
$ nvcc -arch=sm_75 -dc t13b.cu
$ nvcc -arch=sm_75 -lib t13a.o t13b.o -o t13.a
$ nvcc -arch=sm_75 -rdc=true -o t13 t13.cu t13.a
$ compute-sanitizer ./t13
========= COMPUTE-SANITIZER
f1: 0.090000
f2: 0.600000
k2
========= ERROR SUMMARY: 0 errors
$
1 Like

Thank you for the explanation - I’m compiling everything in Windows using MSVS, rather than the command line. Is there a way to configure MSVS to perform the task you’re describing? Currently my static library spits out a .lib file after compilation, which my main project consumes. What should I adjust in my project settings so that device linking happens at the end?

I tried disabling device linking and keeping rdc on in my static lib, then enabling device link in my main executable - but then I get an “unresolved external symbol __cudaRegisterLinkedBinary_xxxxxxx” error when compiling my executable.