Dynamic parallelism childern kernel monitoring

Hi All,

I am parallelizing agglomerative clustering algorithm using dynamic parallelism. I running \ testing the program with very small data size, however facing a difficult challenge to debug. During the program execution child kernels terminate unexpectedly. As program compiles and runs fine for few iteration, I am predicting it is because of number of child kernels that are active. Also I am checking for number of blocks and grids allowed to allocate. And program shows that this number is currently being exceeded.

I was wondering if someone could point me in the right direction to go about debugging my problem.

Thank you,

To be more specific I am getting following error but do not know how to go about debugging for the cause

[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
[New Thread 0x7ffff5c0b700 (LWP 5150)]
[New Thread 0x7fffee3ff700 (LWP 5151)]
[Thread 0x7fffee3ff700 (LWP 5151) exited]
[Thread 0x7ffff7fd4780 (LWP 5132) exited]

I am not sure what is happening. I am basically using some variables to hold values computed in the kernels and used in subsequent ones. Please let me know if you need more information.

I don’t see anything that looks like an error in what you have posted. Perhaps you are seeing threads exiting and assuming that is an error. These are host threads, and host threads can spin up and down for various reasons. The CUDA runtime will often spawn threads which will exit upon application termination. That may be the explanation here.

In any event, I suggest rigorous and aggressive error checking, on parent kernel activity as well as child kernel activity. You can do API level error checking on child kernel launches the same way you do on parent kernel launches. You can also use cuda-memcheck to identify the type of error that occurred to cause unexpected kernel termination, and in some cases cuda-memcheck can localize the error down to a specific line of source code for you, using the -lineinfo compile switch.

When the above methods have been exhausted, you can use in-kernel printf to do more deduction. Finally, the debugger can be used as a final option.

As a single example, there is a limit on the number of child kernels that can be launched (actually that can be currently executing), and if you exceed this limit, a specific error code will be returned from the launch. Are you checking for this? Then you wouldn’t need to wonder if that is the source of the problem.

If you’re unsure about CDP mechanics, I suggest you read the CDP (CUDA Dynamic Parallelism) section of the programming guide:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism