Need solution of "kernel launch timeout" from NVIDIA

The 5 seconds of kernel launch time limitation is making cuda development very inconvenient. :wacko:
Though some kind of programs can be divided into small pieces, lots of programs can’t be divided in an easy way!

To make cuda development more difficult simply cause fewer developers to use cuda.

I just can’t understand why a kernel launch MUST be synchronous? Can’t there be an API that start the kernel and return instantly, and another API to poll whether the kernel launch has finished? A user mode polling/wait can be much better than an kernel mode polling/wait used in current cuda.

By the way, most other parts of cuda are very nice.

Thanks a lot

This is a limitation of Windows XP when the CUDA device is shared between computation and graphics. When your kernel is running, the graphics driver cannot update the GUI, and after a few seconds, the operating system decides something is wrong and aborts your kernel.

The solution to this is to have a CUDA device which is not running your main display. (Or to use Linux, where you can decide not to run a GUI at all.)

I don’t understand this part. From the perspective of the user code, this is exactly what happens. The kernel starts asynchronously, and the CPU continues executing your program. You can check on the status, or deliberately run a function to wait on the results.

Getting rid of the watchdog timer on the primary display would require a way to swap a running kernel’s register file and shared memory out to global memory mid-execution. Then the graphics driver could be given a time slice periodically to avoid the watchdog. Not impossible, though I have no idea if the current hardware is capable of this.

The explanation make things clear. So this limitation come from microsoft instead of nvidia. :blink:

But is there any way to disable this time limit on windows? Windows do this kind of check to prevent a bad graphics call hang the entire system, but a good cuda kernel can still need long time to run. To break the tasks into small pieces can cause a redesign of the cuda kernel, and really make the cuda development more difficult.

Use a dedicated compute card. There’s no way to turn it off on XP, nor should there be.

I just want to make sure I understand the situation…

As a practical matter, there is no way to program in CUDA if your nVIDIA card is the only graphics card in the system, right?

Is it time to dust off an old PCI graphics card to use to drive the desktop? This is kind of a pain as I only have one input on my monitor (Dell 30").

Thanks,

–Mark

No, there’s no way to run kernels longer than 5s on WinXP if you’re using that card for display. If you want to run kernels longer than 5s, your options are:

  • use Vista and turn off the TDR timeout, which is probably a bad idea

  • run Linux from a console (this is the right answer)

  • buy a dedicated compute card (this is also the right answer)

Note that this is not “program execution is longer than 5s” or “total time spent on the GPU longer than 5s,” it’s a single kernel invocation longer than 5s.

You can use CUDA even if there’s only one NVIDIA card in system, you just need to take extra care. Partition your worload into smaller chunks so that single kernel invocation stays well within 2s limitation.
And if you want Windows UI to be responsive while performing compatations on card, you should partition your task in even smaller chunks so that they run for 50ms or so.

Of course you can use CUDA with only one card in the system. It wouldn’t be much use otherwise… tmurray gave you all the details you need I will just add by 2c of anecdotal information:
I’ve been developing CUDA applications for nearly 2 years now on single GPU machines. Nearly all kernels I’ve every written complete in milliseconds. In fact, I have never EVER seen the 5s launch timeout in 2 years of development unless I explicitly tried to trigger it (or, umm, accidentally wrote an infinite loop).

I can probably break up my problem into sub 5 second chunks. What is the typical overhead of a kernel call? I couldn’t find anything on a quick search of these forums.

Thanks,

–Mark

10 usec I believe. When you get close to the 5 sec limitation that overhead is truly negligible.

What about the data stored in device memory? Do they stay in memory after 5 sec ?

Sure, why wouldn’t it? Am I somehow not explicit enough when I say that the only limitation is that a single kernel invocation cannot last more than 5s?