Dynamic Parallelism on TX1

I have written a simple program incorporating dynamic parallelism and built it from command line using “nvcc -arch=sm_35 -rdc=true hello_world.cu -o hello -lcudadevrt” which gives nvlink error.What is the proper way to built it?

Looks like you’re trying to build using SM_35 instead of SM_53. Try using SM_53 instead.

Similar example from CUDA toolkit, nvlink error happens:

ubuntu@tegra-ubuntu:~/7.0.48/NVIDIA_CUDA-7.0_Samples/0_Simple/cdpSimplePrint$ make TARGET_ARCH=armv7l 
/usr/local/cuda-7.0/bin/nvcc -ccbin g++ -I../../common/inc  -m32    -dc -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o cdpSimplePrint.o -c cdpSimplePrint.cu 
/usr/local/cuda-7.0/bin/nvcc -ccbin g++   -m32      -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o cdpSimplePrint cdpSimplePrint.o  -lcudadevrt 
nvlink error   : Undefined reference to 'cudaGetParameterBufferV2' in 'cdpSimplePrint.o' 
nvlink error   : Undefined reference to 'cudaLaunchDeviceV2' in 'cdpSimplePrint.o' 
make: *** [cdpSimplePrint] Error 255

But then works on TX1 when building only SM_53:

ubuntu@tegra-ubuntu:~/7.0.48/NVIDIA_CUDA-7.0_Samples/0_Simple/cdpSimplePrint$ make TARGET_ARCH=armv7l SMS=53 
/usr/local/cuda-7.0/bin/nvcc -ccbin g++   -m32      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_53,code=compute_53 -o cdpSimplePrint cdpSimplePrint.o  -lcudadevrt
ubuntu@tegra-ubuntu:~/7.0.48/NVIDIA_CUDA-7.0_Samples/0_Simple/cdpSimplePrint$ ./cdpSimplePrint 
starting Simple Print (CUDA Dynamic Parallelism) 
Running on GPU 0 (GM20B) 
*************************************************************************** 
The CPU launches 2 blocks of 2 threads each. On the device each thread will 
launch 2 blocks of 2 threads each. The GPU we will do that recursively 
until it reaches max_depth=2 
In total 2+8=10 blocks are launched!!! (8 from the GPU) 
*************************************************************************** 
Launching cdp_kernel() with CUDA Dynamic Parallelism: 
BLOCK 0 launched by the host 
BLOCK 1 launched by the host 
|  BLOCK 2 launched by thread 0 of block 0 
|  BLOCK 3 launched by thread 0 of block 1 
|  BLOCK 4 launched by thread 0 of block 0 
|  BLOCK 5 launched by thread 0 of block 1 
|  BLOCK 7 launched by thread 1 of block 0 
|  BLOCK 8 launched by thread 1 of block 1 
|  BLOCK 6 launched by thread 1 of block 0 
|  BLOCK 9 launched by thread 1 of block 1==========================

Trying to build my application using cmake

my cmakelists.txt:

#Include the folders containing OPENCV
include_directories(/usr/include/)
#set(CMAKE_CXX_FLAGS “-g -O3”)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS}; “-arch=sm_53; -rdc=true; -lcudadevrt” )

set(PROJECT_LINK_LIBS GL GLU X11 glut GLEW opencv_core opencv_imgproc opencv_video opencv_features2d opencv_calib3d opencv_objdetect opencv_flann opencv_stitching )
set(CUDA_VERBOSE_BUILD ON)
set(CUDA_SEPARABLE_COMPILATION ON)
set(CUDA_PROPAGATE_HOST_FLAGS OFF)
set(src kernel.cu
new.cpp
)

cuda_add_executable(out {src} OPTIONS -gencode arch=compute_53,code=sm_53) target_link_libraries(out {PROJECT_LINK_LIBS} ${CUDA_LIBRARIES})

whem i am using this cmake file to build my application m facing a linking error

link.stub:(.text+0x11c): undefined reference to __fatbinwrap_54_tmpxft_00007a07_00000000_7_cuda_device_runtime_cpp1_ii_8b1a5d37' link.stub:(.text+0x120): undefined reference to __fatbinwrap_54_tmpxft_00007a07_00000000_7_cuda_device_runtime_cpp1_ii_8b1a5d37’
collect2: error: ld returned 1 exit status
make[2]: *** [out] Error 1
make[1]: *** [CMakeFiles/out.dir/all] Error 2
make: *** [all] Error 2

kindly help…

Hi soni,

Sorry for the late reply.

The CMakeLists.txt posted in the forum isn’t complete and the generated nvcc commands are not posted.
It’s hard to tell without a complete repro case.

Thanks