My application has several different kernels which are called from the main program in a loop. Typical length of the simulation is several days.
However, the program becomes unresponsive after a few hours. I can see that the process is still active in the system, however produces no output. The GPU resource is likely occupied by the process, but I can’t figure out how to check which one of the kernels is keeping the device busy.
Is there a tool that enables us to see the currently active kernel on the device? My card is GeForce 580, and the system is Centos. It seems that nvidia-smi doesnot support diagnostics for this card.
I also tried checkpoint restarting right before the unresponsive phase, and the code doesnot crash at the same point twice. It is highly unpredictable so a tool that lets me identify which kernel is causing the problem will be helpful.
Thanks,