Current active kernel on device...

My application has several different kernels which are called from the main program in a loop. Typical length of the simulation is several days.

However, the program becomes unresponsive after a few hours. I can see that the process is still active in the system, however produces no output. The GPU resource is likely occupied by the process, but I can’t figure out how to check which one of the kernels is keeping the device busy.

Is there a tool that enables us to see the currently active kernel on the device? My card is GeForce 580, and the system is Centos. It seems that nvidia-smi doesnot support diagnostics for this card.

I also tried checkpoint restarting right before the unresponsive phase, and the code doesnot crash at the same point twice. It is highly unpredictable so a tool that lets me identify which kernel is causing the problem will be helpful.

Thanks,

You might consider attaching the debugger to the unresponsive CUDA application: (docs here).

If you simply want a log then one approach might be to inject cudaStreamAddCallback()/cuStreamAddCallback() callbacks into your kernel launching stream(s). The callback could, at the least, log that the previous kernel was completed and/or the next kernel is about to be launched.

Just be aware that the callback blocks everything downstream until it completes so it’s up to you to determine how lightweight the callback should be – i.e. an atomic increment on the host vs. a slow thread-safe printf().