Calling a child kernel from a parent kernel doesn't work

I’m using a GeForce TITAN card with compute capability 3.5. I have the cuda toolkit 5.0, and I am using and modifying the cudaDecodeGL sample program. Visual Studio is 2010 express. The example is mostly .cpp files, containing the main program and all the Decoder and OpenGL stuff, which I need. There is a single .cu file, used for the YUV to RGB color space conversion - that is where I have added some video processing kernels. This all works fine. The computer is an ASUS M51AC, Windows8, 16GB, Intel Core I7, 3.4GHz.

Now, I want to use dynamic parallelism, where parent kernel threads may conditionally launch child kernels if needed. I haven’t gotten it to work…

  1. First, I tried to duplicate the very simple examples that show launching a parent kernel from the main program using the <<<blocks,threads>>> syntax. This syntax is not recognized and won’t compile (this code is in the .cpp).

  2. So, working just in the .cu file, I added a child kernel (it does nothing) like this:
    device void testkernel() {}

then called it from the parent kernel:
testkernel<<<1,1>>>();
This fails to compile, saying “cannot configure a device

  1. I then observed that all the examples of child kernels show them as global (even though they are on the device and called from the device). So I changed it to:
    global void testkernel() {};

Now it will compile, and create a PTX file from the .cu file. But when I run it, the PTX fails to load, the error CUDA_ERROR_NO_BINARY_FOR_GPU returned by cuModuleLoadDataEx(). As a sanity check, if I comment out the single line that launches the child kernel, it compiles fine, loads the PTX and runs normally.

Can someone see what I am doing wrong? Thanks in advance.