Cmake: kernel fail to launch for the separate compilation of binary and library

My question is if the separate compilation of binary and library may cause the failed launch of the kernel.

See the code:

// mykernel.cuh
#ifndef __MYKERNEL__
#define __MYKERNEL__


__global__
void printData();

#endif
//mykernel.cu
#include "mykernel.cuh"
#include <stdio.h>

__global__
void printData()
{
    if(threadIdx.x < 128)
    {
        printf("Datas: %d \n", threadIdx.x);
    }
}
// main.cu
#include "mykernel.cuh"
#include <stdio.h>

int main()
{
    printData<<<dim3(1,1,1),dim3(128,1,1)>>>();
    cudaDeviceSynchronize();
    return 0;
}

1 Compiled by cmake and make file

Compiled by the CMakeLists.txt:

project (mytest LANGUAGES CXX CUDA)
CMAKE_MINIMUM_REQUIRED(VERSION 3.10 FATAL_ERROR)

find_package(CUDA REQUIRED)

if(CMAKE_INSTALL_PREFIX_INITIALIZED_TO_DEFAULT)
  set(CMAKE_INSTALL_PREFIX "${CMAKE_SOURCE_DIR}/mytestinstall" CACHE PATH "This is default path" FORCE)
endif()

include_directories(${CUDA_INCLUDE})

set(_TEST_cu_sources_   ./mykernel.cu) 

OPTION(TESTSHARED "Separation of bin and lib" On)

if(TESTSHARED)
add_executable(testbin ./main.cu)
target_link_libraries(testbin PUBLIC libtest)
add_library(libtest SHARED ${_TEST_cu_sources_})
SET_TARGET_PROPERTIES(libtest
      PROPERTIES
      CUDA_SEPARABLE_COMPILATION ON
      LINKER_LANGUAGE "CUDA"
      )

INSTALL(TARGETS libtest DESTINATION ${CMAKE_INSTALL_PREFIX})
else()
add_executable(testbin ./main.cu ${_TEST_cu_sources_})
endif()

SET_TARGET_PROPERTIES(testbin
      PROPERTIES
      CUDA_SEPARABLE_COMPILATION ON
      INSTALL_RPATH "$ORIGIN"
      )

INSTALL(TARGETS testbin DESTINATION ${CMAKE_INSTALL_PREFIX})

If the TESTSHARED is set to ON, main and libtest will be compiled separately, and vice versa.

Output:

TESTSHARED = ON: no output, the kernel failed to run.

TESTSHARED=OFF: output is as expected, the kernel worked well.

OS: Ubuntu 20.04, cuda 12.2, sm = 75.

2 Compile manually

Compile the units by nvcc using -c and -dlink.

I am unsure if -dc is over -c here. Since I tried a lot of times, maybe both fit here.

Then link all of them by g++ for either shared or executable.

nvcc -c -Xcompiler -fPIC mykernel.cu
nvcc -dlink -o dlinka.o -rdc=true -Xcompiler -fPIC mykernel.o
g++ -shared -fPIC -o libmykernel.so mykernel.o dlinka.o
nvcc -c -Xcompiler -fPIC main.cu
nvcc -dlink -o dlinkb.o -rdc=true -Xcompiler -fPIC main.o
g++ -o test main.o dlinkb.o libmykernel.so -L/usr/local/cuda/lib64 -lcudart -Wl,-rpath,.

Output: Method #2 works well.

3 Encapsulate contents in the main

Move the details of main to another function named execute in mykernel.cu.

// added the execute declaration to mykernel.cuh
void execute();
// added the execute definition to mykernel.cu
void execute()
{
    printData<<<dim3(1,1,1),dim3(128,1,1)>>>();
    cudaDeviceSynchronize();
}

// now main function is
int main()
{
    execute();
    return 0;
}

Output: this method works fine.

From the outputs above, I suspect that my cmake does not guarantee the cu executable an appropriate -dlink or something (only library made it). Meanwhile, manual compilation also can make it.

The cuda symbols defined directly in the main.cu may not be exported and linked between the host and device sides, and the kernel fails to launch due to wrong runtime configuration.

But this is just a guess.

Thank you if you would correct me.

My suggestion would be to start by using proper CUDA error checking. Not sure what that is? Google “proper CUDA error checking”, then take the first hit, then apply that to your code. Report back any errors that are then reported after you compile and run your code that way.

Thank you for your suggestion.

I copied the error check from that blog googled, and updated my code as

// new main.cu

#include "mykernel.cuh"
#include <stdio.h>
#include <iostream>


#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(const char* const file, const int line)
{
    cudaError_t err{cudaGetLastError()};
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << std::endl;
    }
}


int main()
{
    printData<<<dim3(1,1,1),dim3(128,1,1)>>>();
    cudaDeviceSynchronize();
    CHECK_LAST_CUDA_ERROR();
    return 0;
}

There is no cuda runtime error reported.

Also, I used the compute-sanitizer to run my binary, and 0 errors were given.

Same for cuda-gdb, no runtime error was captured.

What you have shown here doesn’t/shouldn’t require relocatable device code with device linking. There’s no particular problem in using it of course, but if used properly it shouldn’t behave any differently. For example I get the expected printout simply by doing the following with your initial code files as posted in your first post in this thread:

nvcc main.cu mykernel.cu -o test
./test
Datas: ...
(and additional printout)

In your explicit compilation commands, for proper use of relocatable device code with device linking, you must issue the -dc command, not -c here:

but, in your case since the code doesn’t actually require rdc, it doesn’t matter that you are using -c there.

So the problem is in your cmake usage, which you already knew. I’m not an expert on cmake, but any time someone is having trouble with cmake, I would suggest using cmake verbose output (just google for that, its not unique or specific to CUDA) and compare the verbose output syntax to what you have already demonstrated as working with your explicit compilation commands.

Your explicit compilation commands are remarkably complex for such a simple test case, but that may be reflective of what you are actually trying to do. In any event, I would point out that rdc doesn’t work across a .so boundary. Device code linking can only be done to a static library. However, as already stated, that shouldn’t be impacting the example here. I’ve already mentioned it doesn’t require device code linking.

Thank you for your detailed response.

The reason that introducing the device link or dc is that the full example (or my full workspace code) depends on some extern __constant__memory.

I am trying not to lose the generality of the code, but you are right, it should not affect the output of this tiny program.

Before the appearance of this problem, I am working well with this cmake file for my previous two cuda codes (python binded with shared, no executable involved).

But now I have to deal with some bin + shared codes, so I try to dive into the details of the unexpected behavior.

Some documentation (official one or community) mentions the static requirement of the device linking, thus, I reproduced the VERBOSE of makefile generated by this cmake file.

Cmake/make adopted some -fPIC, -c and -dlink flags as those in the method 2. Though I can compile my code (line by line in the shell) following the verbose of cmake, the bin obtained from make command fails to output anything, which is confusing.

Maybe the trade-off is that I put those codes into another cu source file and call them indirectly in main.cu, which is verified effective in method 3.

I will find out more about cmake and cuda later, after my current duty is done .

Thank you for your time!