I’m looking for a piece of advice. I have a regular host function that invokes a simple CUDA kernel. When I build the source code targeting an architecture more recent than my card’s, the kernel fails to launch with error 209 (cudaErrorNoKernelImageForDevice). So far, so good. (see snippet below)
If I build that same file as a shared library, dlopen+dlsym the host function and execute it, the very same error code is printed after the unsuccessful call to the kernel, as expected:
Now, here’s the issue. If I call cudaPeekAtLastError() right after the call to fn(), it gives me back 0 – as if no errors happened in the execution of that kernel. I’ve been scratching my head trying to understand what’s going on, but so far I couldn’t find a reasonable explanation. Could someone shed some light here?
I’m attaching a Python script that reproduces the problem (assuming your card does not belong to the compute_80 family). Any piece of advice is very welcome.
Below you find the CUDA file that compiles to a shared library with nvcc -Xcompiler -fPIC -shared -gencode arch=compute_80,code=sm_80 cuda_error.cu -o libcuda_error.so. The architecture mismatch error is expected because my laptop comes with a GeForce MX250 (Pascal architecture, compute 6.1). For completeness, I’m running CUDA 11.4 on Linux .
And this is the main C file that loads that shared library and that fails to peek at the last error. I’m building it with nvcc cuda_error.c -o cuda_error -ldl: