cudaErrorLaunchTimeout and CUDA2.0

While testing my kernel on extremely big data sets I’ve faced the following error: cudaErrorLaunchTimeout. This happens after about 14 seconds after the launch. If to reduce the data set so the kernel would cope with it in about 12-13 seconds then all goes fine.

My video card IS NOT connected to the monitor, no watch dogs and other similar things should happen.

Also, I’ve noticed that this error happens with CUDA 2.0 but not with CUDA 1.1. In cuda 1.1 there was another unpleasant thing: infinite kernel could hang the system up, looks like in cuda 2.0 there is one additional monitor-independent watch dog that terminates the kernel after about 15 seconds. It is good for faulty kernels but it is a disaster for kernels that normally work for a long time.

Programming guide does not inform about this new feature, also, no info on how to switch this extra watch dog off.

Does anybody know how to prevent forced kernel termination ? Nvidia guys, the question goes primarily to you …

Thanks in advance!

Well, I have noticed no such thing, I have to kill my machine when a kernel enters an infinite loop. I have let kernels run for 24hrs before killing the machine.

It might be that after 13 seconds you reach a point where you write past the end of an array.

I have commonly seen these errors occur with kernels that write past the end of allocated memory.

I have also seen this behavior with a relatively simple kernel (normally executes in 1 millisecond) with no memory access errors. But in this case, the kernel would execute normally ~50,000 times and then get into an infinite loop on the next call. As hard as this problem was to reproduce, I only have a vague idea what caused it: it seemed to be many complicated warp divergences (for loops with different lengths in each thread) or maybe by too many __syncthreads() (in a different kernel). NVIDIA confirmed they could reproduce the issue but hasn’t resolved the bug yet. I worked around it by rewriting the kernels with small changes.

I see …

However, I have no __synchthreads() and absolutely no divergence - just too much data to compute in 10-12 seconds.

Is is possible that you may be writing past the end of allocated memory? Sometimes it can be hard to know for certain. One way to check is to compile in emulation mode and execute the program through valgrind (linux only) or a similar memory bounds checking tool.