question about "launch timed out"

Hi, guys, I am using CUDA to do a Kinetic Monte Carlo simulation. But when I ran my code, I got this error:
“the launch timed out and was terminated.”

I have watched the cubin file, the number of registers and shared memory are all OK.

And I have observation about this error:
That is if I enlarge the input size and the loop times in the kernel at the same time, it will cause this error.
It seems like there is a balance between the input size and the loop times.

Here is my testing result:
input size loop times without error report
400400 100
600
600 10
1000*100 1

if I went over the loop times, it will report this error.

So, anybody can help?

Driver calls (which are generally kernel-mode function calls) in most operating systems can only run for so long, the OS kernel will terminate any calls that exceed a certain time - this is commonly referred to as the ‘watchdog timer’ - it’s basically the OS kernel killing deadlocked/hanging kernel mode calls.

From what I understand nVidia’s driver has a function/thread associated with every launched kernel, which stays active until the kernel completes execution - thus if your kernel takes longer than the watchdog timer of the OS allows, the OS kernel will terminate that associated driver function call that’s running your kernel - your kernel will die - and ‘hopefully’ (though in windows you generally get a BSOD) you get a timed out error.

If I’m not mistaken anything based on the NT kernel has a ~5 second (this varies for unkown reasons, but ‘around about’ 5 seconds) watchdog timer.
Anything based on linux has a user-settable watchdog timer (I’m not sure on the default, if any default - I’m guessing it depends on your kernel config options when compiling).

Side note: Realistically it’s a softdog (watchdog timer implemented in software - i.e.: the kernel) timer on most consumer PCs - as watchdog timers are usually supposed to be separate add-on cards, which don’t rely on the OS to operate - and have direct control over certain BIOS features (eg: capable of resetting the system if the timer is triggered).

Edit: I should probably note there’s a lot more to it than that, not all kernel mode / driver calls have this limitation - I’m not a device driver programmer by any means, but this only applies to certain types of drivers (I’m not sure if this is dictated by the OS, or the driver author).
Much more specifically, in terms of CUDA - this watchdog timer allegedly only applies to CUDA code running on cards which are also display devices.

So you can avoid this problem in two ways,

  1. Run your CUDA code on a seperate CUDA card which isn’t used as a display device.
  2. Ensure your CUDA kernel calls execute in less than 5 seconds (eg: by breaking it up into smaller kernels).

My experience is that “only applies to CUDA code running on cards which are also display devices.” is true.

There are two computers at work on which cuda programming is done.

On is equipped with a cuda capable card + an old 7000 something that runs the X server.

The other doesn’t have a monitor connected to it. All kernel are executed from a commandline. If I start up the X server on that box, I get timeouts, but I don’t need it, so it’s a non-issue there.