Error in lunching a kernel "the launch timed out and was terminated"

Update: Problem solved due to an infinite loop in the kernel :(

Hi,

My kernel function is within a loop. The code runs fine when the kernel function is called in the first time. But when it is called again, I got an error message returned by cudaGetLastError() and cudaGetErrorString() saying “the launch timed out and was terminated”

I searched in the forum and it seems this is related to the watchdog timer trigger, which is about 5 seconds on Linux (I am using Ubuntu). I followed the suggestions and tried to turn off the x-server by “sudo /etc/init.d/gdm stop”. Then, the error message was gone, but the kernel function appeared running forever in the second call.

Then, I thought it would be better to time profile the kernel to see how long it took when it was executed correctly in the first time. I tried to create cuda event to do the time profile (follow the programer guide), it returned me zero second. I also tried to use clock() function in C, and again it returned me zero second. The last, I tried to use gettimeofday() in C so that I can record the wall clock time. This returned me 14 seconds for the kernel. I have printf statement before and after the kernel function for input and output. The time between I saw input and output was close to 14 seconds as expected. Occasionally, when the kernel was running, I used Ctrl+C to terminate the kernel at around 3 seconds, and the outputs were the same as those if I let the kernel complete normally in 14 seconds, which indicates that the kernel actually has stopped executing much earlier than 14 seconds.

My question is that how can I tell what is going on between 3 seconds and 14 seconds in the kernel. Why did the clock based timer return zero value.
What does the “the launch timed out and was terminated” indicate?

I am using Tesla C2050. For debugging purpose, the kernel uses 32 threads, but only one thread is executed.

Thanks!

I just found this error message is associated with the first run of the kernel function

It appeared to be due to the second call of kernel function because I mistakenly placed cudaGetLastError() before cudaThreadSynchronize(), so in the first run, although the execution time is longer than 5 seconds, the cudaGetLastError() cannot detect this until the second run.

But still not sure why the kernel takes so long, and actually it does the job in much shorter time.