dynamic parallelism with cuda driver api

Does dynamic parallelism even work with the cuda driver api?

The examples I’ve seen have all the code (cpu and device) in a .cu file, and are compiled and linked straight to an executable, never creating a PTX file.

Can dynamic parallelism work when the device code containing parent and child kernels is compiled to PTX and then linked?

It seems like it should, since dynamic parallelism just changes the allowed syntax inside device and global functions that are compiled by nvcc the same way for both the device and runtime API. You should try it and see.

Thanks for the response, seibert.

Conclusion, so far: any time one includes a call to a child kernel from a parent kernel, and then compile to a .ptx file, when trying to load the ptx at runtime, one gets CUDA_ERROR_NO_BINARY_FOR_GPU.

As a sanity check, simply removing the call to the child kernel and recompiling, creates a ptx that loads and works fine.

My system has a GeForce TITAN (CC=3.5), and the simple dynamic parallelism examples do work correctly on this system.

SO, EVERYONE, STILL WAITING TO SEE AN EXAMPLE OF DYNAMIC PARALLELISM WORKING VIA A PTX FILE. I believe the solution to this will be of interest to a lot of people.

I’ve never attempted this before, and don’t know whether it’s possible or not, but did you remember to register the child kernel with the driver API? Unlike a driver API-only program, an nvcc-compiled .cu program should do that automatically for you.

I have the dynamic parallelism working on my Geforce TITAN system now.

This solution is actually based on the “cppIntegration” simple toolkit example.

I moved the host routines that launch the parent kernels into my .cu file. Those host routines are called from my host code in .cpp files.

nvcc is run with “-compile” instead of “-ptx”. The .cu.obj is “device linked” with “nvcc -dlink” into a .device-link.obj, and the Linker creates the final .exe.

I am able to put in calls to child kernels from my parent kernels and it builds and works correctly.

I’m facing exactly the same problem. According to #5, you seem to find a
solution with the runtime APIs, and gave up using the driver APIs.

In my case, however, the driver APIs are mandatory. My code serves as
a kernel launcher, which execute arbitrary kernels on the fly. The
kernels cannot be linked at the build phase since they do not exist at
that moment. Kernels must be provided as PTX (or any other) images,
to be loaded by cuModuleLoad() or cuModuleLoadData().

Any good idea?

Any resolution on dynamic parallelism with driver API?

Thanks.