breaking or interrupting GPU computation

jarjar · October 25, 2008, 4:32am

One common bug that I commit is infinite loop.

Infinite loop mostly happens when I forget or make some mistake
in the calculation of the loop variable.

When I have an infinite loop in my GPU kernel code, my whole
machine freezes and I have to restart the machine
and continue with the debugging.

Is there an elegant or better way to break or interrupt the GPU
computation ?

FYI: My machine is running on WinXP with Visual Studio 2005 (CUDA 2.0),
and I have a GTX280 card.

Thanks…

_Big_Mac · October 25, 2008, 11:05am

Is this card your primary display adapter at the same time? In that case, shouldn’t a watchdog kick in after 5-10 seconds and kill this kernel?

jarjar · October 25, 2008, 11:30am

Yeah. The card is my primary display adapter…

The whole system just freezes and I have’nt noticed the watchdog timer

kicking in any time…

_Big_Mac · October 25, 2008, 8:14pm

Now that you mention it, I remember that when I tried launching huge kernels (for testing) twice, first I received a nice failure message in the command line (I believe something about a timeout) and on the second try I got a bluescreen. It seems to me that the watchdog can be sort of hit and miss. I’ve also seen reports that a kernel failure can corrupt GPU memory to the point that it requires rebooting your machine even though theoretically the driver should clean up in such occasions.

Handling screwups on the device is not implemented very gracefully I believe.

I can’t think of any elegant methods of stopping infinite loops. Perhaps you could estimate the number of iterations and add a condition to stop looping after there’s been 10x as many? Like

while (myConditions==true && iterations<10000)

It’s not pretty and it’s not always possible to make this estimate. I can’t come up with anything better, our interactions with a running kernel are very limited (since I/O is handled by CPU).

Romant · October 26, 2008, 8:43am

Programmer is completely unable to stop the running kernel (say, the “stop the kernel if it runs for too long” operation is not possible). Integrate some triggers into your code (count iterations or something) to prevent the kernel running forever.