linking of cuda kernels via cmake segfaults upon execution

Hello, I’ve been spending the day trouble shooting a way to add CUDA code to a TX2 host app, programmed in C++ code and compiled with Cmakelist/catkin_make.

The compile process is completing without error, but the program segfaults upon execution.
I was wondering if anyone on this forum could specify whether my usage of NVCC is correct, or whether there is a bug in my setup.

Toy example:

cuda_kernel.cu:

#include "cuda_kernel.hpp"
__global__ void test_cuda(int *a, int b)
{	
	a += 1;
}

cuda_kernel.hpp:

#include <cuda.h>
#include <cuda_runtime.h>
__global__ void test_cuda(int *a, int b);

Calling function in master cpp (app_source.cpp):

int one = 1; 
test_cuda(&one, 1);

CmakeList:

cmake_minimum_required(VERSION 2.8.3)
cmake_policy(SET CMP0012 NEW)

project(test)

find_package(CUDA REQUIRED)
set(CUDA_NVCC_FLAGS 
	${CUDA_NVCC_FLAGS}; 
	-O3 -lineinfo
	-gencode=arch=compute_62,code=sm_62)		
set(CUDA_VERBOSE_BUILD ON CACHE BOOL "nvcc verbose" FORCE)
#set(CUDA_SEPERABLE_COMPILATION ON)	
#set(BUILD_SHARED_LIBS ON)
#set(CUDA_ATTACH_VS_BUILD_RULE_TO_CUDA_FILE OFF)

set (LIB_TYPE STATIC)
#set (LIB_TYPE SHARED)
	
CUDA_ADD_LIBRARY(cuda_functions ${LIB_TYPE} src/cuda_kernel.cu)
	
include_directories(... ${CUDA_INCLUDE_DIRS} ${catkin_INCLUDE_DIRS} ${Boost_INCLUDE_DIRS})
link_directories(... ${CUDA_CUFFT_LIBRARIES} ${CUDA_LIBRARIES})

set(CMAKE_CXX_FLAGS "-std=c++0x ${CMAKE_CXX_FLAGS} -march=native -lpthread -O3 -fopenmp")

add_executable(appname app_source.cpp)
add_dependencies(appname cuda_functions)

target_link_libraries(appname 		
	${catkin_LIBRARIES} 
	${CUDA_LIBRARIES} 
	${CUDA_CUFFT_LIBRARIES}
	cuda_functions)

I have tried compiling the cuda code as both static and shared libraries, but both appear to segfault whenever I call the test function. Any suggestions on how the cuda compiles can be better linked?

Hi,

  1. GPU can’t access CPU memory.
    You need to allocate GPU accessible memory via CUDA API.

  2. Please use CUDA API to launch CUDA kernel.
    ex. test_cuda<<<64, 64>>>(…);

Here is a simple example for your reference:
https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/

Thanks for the tip.

Is that to say the kernels cannot access unified memory in the way that other cuda runtimes, such as cuFFT are?
Since I am on pascal, I was able to directly pass CUDA enabled functions a pointer to system memory, and it worked as long as it was allocated with cudaManaged; is this safely extensible to custom kernels as well?

Hi,

  1. API, such as cuFFT, contains a C++ wrapper to handle CUDA function call.
    For you own CUDA kernel, please remember to launch it with test_cuda<<<64, 64>>>(…)

  2. Unified memory can be access by CPU and GPU.
    But unified memory should be allocated via cudaMallocManaged(…)