dynamic parallelism with cuda driver api

dc23 · January 4, 2014, 1:41am

Does dynamic parallelism even work with the cuda driver api?

The examples I’ve seen have all the code (cpu and device) in a .cu file, and are compiled and linked straight to an executable, never creating a PTX file.

Can dynamic parallelism work when the device code containing parent and child kernels is compiled to PTX and then linked?

seibert · January 5, 2014, 7:55pm

It seems like it should, since dynamic parallelism just changes the allowed syntax inside device and global functions that are compiled by nvcc the same way for both the device and runtime API. You should try it and see.

dc23 · January 7, 2014, 1:57am

Thanks for the response, seibert.

Conclusion, so far: any time one includes a call to a child kernel from a parent kernel, and then compile to a .ptx file, when trying to load the ptx at runtime, one gets CUDA_ERROR_NO_BINARY_FOR_GPU.

As a sanity check, simply removing the call to the child kernel and recompiling, creates a ptx that loads and works fine.

My system has a GeForce TITAN (CC=3.5), and the simple dynamic parallelism examples do work correctly on this system.

SO, EVERYONE, STILL WAITING TO SEE AN EXAMPLE OF DYNAMIC PARALLELISM WORKING VIA A PTX FILE. I believe the solution to this will be of interest to a lot of people.

JaredHoberock · January 7, 2014, 5:36am

I’ve never attempted this before, and don’t know whether it’s possible or not, but did you remember to register the child kernel with the driver API? Unlike a driver API-only program, an nvcc-compiled .cu program should do that automatically for you.

dc23 · January 8, 2014, 12:05am

I have the dynamic parallelism working on my Geforce TITAN system now.

This solution is actually based on the “cppIntegration” simple toolkit example.

I moved the host routines that launch the parent kernels into my .cu file. Those host routines are called from my host code in .cpp files.

nvcc is run with “-compile” instead of “-ptx”. The .cu.obj is “device linked” with “nvcc -dlink” into a .device-link.obj, and the Linker creates the final .exe.

I am able to put in calls to child kernels from my parent kernels and it builds and works correctly.

gyawai · January 24, 2014, 9:02am

I’m facing exactly the same problem. According to #5, you seem to find a
solution with the runtime APIs, and gave up using the driver APIs.

In my case, however, the driver APIs are mandatory. My code serves as
a kernel launcher, which execute arbitrary kernels on the fly. The
kernels cannot be linked at the build phase since they do not exist at
that moment. Kernels must be provided as PTX (or any other) images,
to be loaded by cuModuleLoad() or cuModuleLoadData().

Any good idea?

RianFlo · January 7, 2015, 11:37am

Any resolution on dynamic parallelism with driver API?

Thanks.

Topic		Replies	Views
Calling a child kernel from a parent kernel doesn't work CUDA Setup and Installation	0	803	December 31, 2013
Error loading dynamic parallelism kernel from fatbin via CUDA driver api CUDA NVCC Compiler cuda , kernel , nvcc	2	71	May 9, 2025
About dynamic parallelism of CUDA Fortran Legacy PGI Compilers	7	9302	December 2, 2016
Dynamic Kernel Function Runtime code generation CUDA Programming and Performance	17	25937	March 26, 2013
Dynamic Parallelism CUDA Programming and Performance	5	8026	June 24, 2014
Is dynamic parallelism suitable for this application? CUDA Programming and Performance	3	1258	August 20, 2013
Compile cuda program with Dynamic Parallelism Jetson TX2	4	3822	October 18, 2021
a question about low performance on dynamic parallelism with tremendous data CUDA Programming and Performance	2	1233	May 27, 2013
Performance drops with dynamic parallelism CUDA Programming and Performance cuda , dynamic-control	12	1008	June 3, 2024
Dynamic parallelism on Jetson TX1 isn't working properly Jetson TX1	0	594	June 20, 2016

dynamic parallelism with cuda driver api

Related topics