Issue running CUDA kernel from a shared library

Hi, I am a fairly experienced C++ programmer who is new to CUDA. I am having an issue with getting a kernel to execute when calling it from a class in a compiled library. I have tried for a while to get it to work, but have failed thus far. I have produced a minimally reproducible example below.

Directory structure is as follows:

cudatest
|-lib
|  |-inc
|  |   |- utilities.cuh
|  |   |- solver.cuh
|  |-src
|      |- solver.cu
|-test
     |- test.cu

I am building with CMake:

cudatest/CMakeLists.txt

FILE(REMOVE ${CMAKE_SOURCE_DIR}/CMakeCache.txt)

IF(${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_BINARY_DIR})
  MESSAGE(FATAL_ERROR "\n\nPlease build in a separate build directory. 
             Ensure that you remove the CMakeCache.txt and CMakeFiles/ from the top level directory")
ENDIF(${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_BINARY_DIR})


CMAKE_MINIMUM_REQUIRED(VERSION 3.22)

PROJECT(CUDATEST
    LANGUAGES CXX CUDA)
SET(CMAKE_CUDA_STANDARD 20)
SET(CMAKE_CUDA_STANDARD_REQUIRED ON)
SET(CMAKE_CXX_STANDARD 20)
SET(CMAKE_CXX_STANDARD_REQUIRED ON)

SET(EXECUTABLE_OUTPUT_PATH ${CUDATEST_BINARY_DIR}/bin)

add_subdirectory(lib)

INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/lib/inc)
ADD_EXECUTABLE(cudatest_ex test/test.cu)
TARGET_LINK_LIBRARIES(cudatest_ex cudatest_culib)
SET_TARGET_PROPERTIES(cudatest_ex PROPERTIES OUTPUT_NAME "cudatest.ex")
SET_TARGET_PROPERTIES(cudatest_culib PROPERTIES CUDA_ARCHITECTURES "75")

lib/CMakeLists.txt

SET(cudatest_culib_src src/solver.cu)
INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR}/inc)

ADD_LIBRARY(cudatest_culib SHARED ${cudatest_culib_src})
SET_TARGET_PROPERTIES(cudatest_culib PROPERTIES OUTPUT_NAME "cudatest")
SET_TARGET_PROPERTIES(cudatest_culib PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CUDATEST_BINARY_DIR}/lib)
SET_TARGET_PROPERTIES(cudatest_culib PROPERTIES CUDA_ARCHITECTURES "75")

Here is my source :

test/test.cu:

#include "solver.cuh"

int main(int argc, char *argv[])
{
    Solver slvr;

    bool use_external_lib = true;
    if(use_external_lib){
        slvr.solve();
    }
    else{
        Utilities::doNothing<<<1,1>>>();
        CUDA_CALL(cudaGetLastError());
        CUDA_CALL(cudaDeviceSynchronize());
    }
}

lib/src/solver.cu

#include "solver.cuh"

void Solver::solve(){

    Utilities::doNothing<<<1,1>>>();
    CUDA_CALL(cudaGetLastError());
    CUDA_CALL(cudaDeviceSynchronize());

}

lib/inc/solver.cuh

#ifndef solver_cuh
#define solver_cuh

#include "utilities.cuh"

class Solver{

public:

Solver(){};
~Solver(){};

void solve();


};


#endif

lib/inc/utilities.cuh

#ifndef utilities_cuh
#define utilities_cuh

#include <iostream>

#define CUDA_CALL(ans) { gAssert((ans), __FILE__, __LINE__); }
inline void gAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"Assert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}


namespace Utilities{

__global__ void doNothing(){
   int tidx = threadIdx.x + blockDim.x*blockIdx.x;
	printf("I, thread %d, am doing nothing!\n", tidx);
}

}


#endif

If, in test.cu, I change use_external_lib to false I get the appropriate output: “I, thread 0, am doing nothing!” If I set it to true I get absolutely nothing. nvprof shows that the kernel code doesn’t get executed in this case. Also,compute-sanitizer returns no errors.

I am sure that the issue is either a really dumb one, or has to do with my misunderstanding of how CUDA compiles/links/runs, but I am at a complete loss here as to how to get this to work. I appreciate any input.

ETA: Here is my nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:3B:00.0 Off |                  N/A |
| 33%   41C    P8             39W /  250W |     763MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:5E:00.0 Off |                  N/A |
| 30%   31C    P8              8W /  250W |       2MiB /  11264MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     22732      G   /usr/libexec/Xorg                             461MiB |
|    0   N/A  N/A     24732      G   /usr/bin/gnome-shell                          235MiB |
|    0   N/A  N/A     25695      G   /usr/bin/gnome-software                        30MiB |
+-----------------------------------------------------------------------------------------+

You have two compilation units. One is associated with your test.cu which is your main routine. The other is associated with solver.cu which is in the shared library.

Both of the these compilation units are including solver.cuh which is including utilities.cuh which has the definition of the kernel in the Utilities namespace.

So now you have 2 identical function definitions (definitions, not declarations) of Utilities::doNothing, in your project. I generally think that is a bad idea, and it really has nothing to do with CUDA. I don’t think its a good idea in ordinary C++ either. Just my opinion.

I think conventional wisdom in project structure is to put function declarations in header files, and function definitions in source files, to avoid this. You’ve not done that (in my view, anyway), and so the inclusion of the header in multiple places has created this situation.

At link time, then, the linker must choose between one of those. It will likely make the same choice for both uses, and one of these, without specifying relocatable device code with device linking, will fail, if you attempt to use it.

When people decide they must put a function definition in a header file, and want to include it in various places, a common piece of advice I have seen is to mark the definition with static keyword, to force the linker to choose a module-local entry point at link time (ie. to resolve ambiguity), and/or to resolve a “multiple definition error” (which is not seen here due to the separate library). That is one possible solution path here, according to my testing:

static __global__ void doNothing(){

The other solution path would be to specify relocatable device code with device linking. I’m not a CMake expert, and CMake is not a NVIDIA product, so I don’t plan to give advice here, but this is a topic that comes up from time to time, so you can find examples of what to do with a bit of searching. The problem in that case is that device linking in CUDA is not supported across a shared library interface. So sorting it out this way might be rather involved.

I would say the 3rd option is “don’t do that”. i.e. just don’t create:

  1. multiple definitions that are exactly the same
  2. a need for device-linking across a shared library interface.

That could require a variety of restructuring, depending on what you are trying to do. As a trivial example, you could create a namespace Utilities1 that has doNothing in it, and is only used (i.e. defined, and called) in the main compilation unit. Create another namespace Utilities2 that has doNothing in it (totally fine, since they are separate namespaces) and is only used (i.e. defined, and called) in the library compilation unit.

I haven’t used CMake to do my testing. But with the addition of the static keyword, I can compile your project and observe that it works correctly with trivial nvcc command lines:

nvcc -Xcompiler -fPIC -shared solver.cu -o libsolver.so
nvcc -o test test.cu -L. -lsolver

So if you decide the addition of static doesn’t “fix” the issue, there may be a problem with your CMake setup, which again, I won’t be able to help with. When people are having trouble with CMake I usually advise to first try and make sure you can sanely compile the project using ordinary nvcc command line invocations, and then look at CMake verbose output to spot deviations.