I have just started working on a quadratic filter to be applied to a 16 bits/channel image.
Currently, all I am interested in is to apply the filter definition without resorting to svd or fft.
I successively launch my kernel 3 times; one for each channel.
I am not using shared memory, just global and locals.
When I set the filter dimension (the size of the square around the pixel in question, whose values are used in the computation) beyond a certain value, first launch (for red channel) does not give any errors but does not return anything either. The other two launches (green and blue) give the “the launch timed out and was terminated” error.
Is there a built-in watchdog timer mechanism in cuda that is preventing proper execution and completion of my first launch ?
It cannot be an infinite loop:
a- The same code with the same data set runs properly in emu-debug, very slowly of course.
b- The same code with a smaller data set runs properly on the GPU.
It must be hitting the 5 second limit, then.
Ran the kernel as <<<1,1>>> with an, otherwise offending, filter dimension (i.e. applied the filter to pixel 0 only).
The loops in the code are controlled by a function argument, the filter dimension (or tap), not derived from blockIdx or threadIdx values.
cutxxxx() reported execution time was less than 0.065 ms.
I am now suspecting a severe case of serialization.