Filter Problem (the launch timed out and was terminated)


I have just started working on a quadratic filter to be applied to a 16 bits/channel image.
Currently, all I am interested in is to apply the filter definition without resorting to svd or fft.

I successively launch my kernel 3 times; one for each channel.

I am not using shared memory, just global and locals.

When I set the filter dimension (the size of the square around the pixel in question, whose values are used in the computation) beyond a certain value, first launch (for red channel) does not give any errors but does not return anything either. The other two launches (green and blue) give the “the launch timed out and was terminated” error.

Is there a built-in watchdog timer mechanism in cuda that is preventing proper execution and completion of my first launch ?

Suggestions are welcome and appreciated.


Its actually a watchdog defined by windows which should be 5 seconds.

Your kernel probably has an infinite loop in it (or its very very complex and in that case you should break it into

a few kernels) - probably has an infinite loop though :)

As for the watch-dog just search the forum you’ll see a lot of references about this.


It cannot be an infinite loop:
a- The same code with the same data set runs properly in emu-debug, very slowly of course.
b- The same code with a smaller data set runs properly on the GPU.
It must be hitting the 5 second limit, then.


Thats not correct - imagine if you have a loop in the kernel and the loop iteration count is calculated in your

kernel. In emu-debug, the calculation might be correct but when you run in release the value turns out

to be garbage or very large.

In general the fact that a code runs perfectly with emulation doesnt mean it runs properly in release on the GPU.

Smaller data sets might work fine because you’re not going out of bounds reading/writing garbage values

affecting the kernel time.

Maybe you can post the kernel code, or try to pin-point the problem yourself by commenting portions of your

code till you find the offending code.

hope that helps.


Ran the kernel as <<<1,1>>> with an, otherwise offending, filter dimension (i.e. applied the filter to pixel 0 only).
The loops in the code are controlled by a function argument, the filter dimension (or tap), not derived from blockIdx or threadIdx values.
cutxxxx() reported execution time was less than 0.065 ms.
I am now suspecting a severe case of serialization.