Launch Timeouts

I’m very close to having a working program of what I’m trying to accomplish, but for some reason I’m getting

Cuda error: Kernel execution failed in file ‘testbed.cu’ in line 367 : the launch timed out and was terminated.

when trying to run the kernel (in debug mode), but the weird thing is that it only happens sometimes. The program should invoke the kernel and exit if it needs more data to analyze, get more data, then repeat the process. Sometimes it will run the kernel once, sometimes more than once, without any problems.

What could cause this to time out? If you need me to paste in any code, I can do that.

I have a similar problem. I get the same error when I reach a cetain number of threads. Above this number I always get the error. Below it, I have yet to have any problems. At this number, I get the error intermittently. I’ve also noticed that if I reduce the number of computations per thread (some times just by decreasing the size of a loop) that I can increase the number of threads I can run w/o getting the error.

Any suggestions will be welcomed.

You two may actually be having different issues. Most common causes of problems similar to yours:

tcullison:

  • too many registers per thread. You will not be able to launch if (registers per thread)x(threads per block) > 8192. Compile with -keep option and check the .cubin file to see how many registers are being used. You can try to reduce that with the -maxrregcount flag to nvcc (check nvcc documentation for details). Judging by your description, this most likely is your issue.

  • too much shared memory. A threadblock can use no more than 16K of shared memory, otherwise it fails to launch. Again, check the .cubin file, plus add whatever you’re allocating in shared mem dynamically.

nkohlmei:

  • your kernel runtime exceeds the time allowed by the watchdog mechanism. I believe that’s 5s in WinXP, not sure about the number in Linux.

paulius:

Thank you for the advice.

I have since checked that (registers per thread)x(threads per block) is indeed less than 8192.

After checking the cubin file, I also believe that the amount of shared memory I’m using is OK as well (56 * 512 threads per block).

However, if I modify the code a little (so that I am not accessing as much global memory) the code works fine. I have used the -maxrregcount flag count with the same register restriction when compiling both the modified and unmodified code; yet, the modified code executes while the unmodified does not. For clarity, the modified code is defined as the code for which I am not accessing global memory as much and the loops in the modified code iterate as much or more than the loops in the unmodified code.

Do you have any other suggestions, or insight?

I appreciate your help.

Can you post your code (or provide a link or attachment)? Are you allocating shared memory dynamically when calling the kernel?

Also, you don’t have to multiply the smem value in the .cubin file by the number

of threads in a block - that value is for the entire block.

Paulius

paulius:

Thank you for your help.

I’m not allocating any dynamic memory when calling the kernel. Unfortunatelly, I cannot post my code. Although some time this week, I might come up with a bit of code that has the same problem that I can post.

I will post if I find anything new.

-tcullison

You never said if you solved your problem or not. I was having exactly the same issue. Kernels that should execute in < 10ms would randomly give the launch timeout error, but only every 1 in ~10,000 launches. In my case, it turned out to be an incorrectly installed driver, which was causing other issues too. Run nvidia-bug-report.sh and check for any API mismatch errors to see if you have the same root cause.

See http://forums.nvidia.com/index.php?showtop…ndpost&p=231374

I am running a kernel that takes more than 8.7 second to finish. I got the same error now:

the launch timed out and was terminated.

What should I do to avoid it? External Media

With a smaller problem size, the same kernel runs 7 second, and there is no launch error.

That seems odd. Are you running on the linux console mode so that you don’t have a 5s limitation to begin with?

And to clarify an old post of mine above: my particular problem was not solved by correctly installing the driver. It still persists even in the CUDA 1.1 beta and NVIDIA is working on a solution (crossing fingers that the solution will be in the 1.1 release version). The particular calling card of this problem is a kernel that normally executes in a very short time (ms), but if you call it 100,000 times in a row, even with the SAME DATA, it will only get to 10,000 or 20,000 calls before a launch takes 5s and then either gives “launch timeout” or “unspecified launch failure”. Running in the linux console (no 5s limitation) just causes the machine to hard lock when it reaches this point.

I have resolved these issues by putting second video card which works as primary display only. How about this solution?

byung,

I reduced my block number by half and it can now barely run with less than 6.7 seconds. I think the performance is lowered but at least it can run. I heard about the 5 second limitation. But I don’t know why my computer/GPU can run up to a bit more than 7 seconds per kernel. I am using windows XP, running CUDA 1.1beta.

My computer is a Dual Xeon CPUs 3.0GHz dell precision 670 workstation. It is about 2.5 years old. GPU is G8800 Ultra. Just got it a few weeks ago.

To byung:

I do have two GPU cards but I have only one PCI-e slot in my computer. So at the time being this is not an affordable solution for me to buy a new computer. But thank you very much for the information.

Some have any idea of why a kernel cannot take more then 7s to execute?

Hi,

Yes I’m currently running into the same issue as well. I’m bumping this to see if perhaps someone is the wiser on this issue?

A colleague of mine mentioned that if the card isn’t set to COMPUTE ONLY mode then the card will want to do some display related update after some time. I don’t know if that’s in any way related.

thanks
j

So, what you are saying is that if I have just one card that to the redenring I’m not able to write a kernel that needs more than 7 seconds to execute?

Do you know how to set the device to COMPUTE ONLY mode? I never hear somenthing like that before…

I’m not saying that, I have a vague guess that it might be related. :)

I don’t think it’s called COMPUTE ONLY exactly, it’s got some other very similar name. Will search and get back to you…

EDIT: It’s called Compute-Exclusive mode https://www.wiki.ed.ac.uk/display/ecdfwiki/…-Exclusive+Mode

I’m not sure if this fixes our problem…

Hum, we need to figure out a way to avoid this problem when using just one card.

I let you know if found something.

Thanks

anything found on this issue ?

I’m actually having the same issue and can’t afford all my user to have 2 graphics card ;-)

thanks for sharing anything you found so far !

Stephane

anything found on this issue ?

I’m actually having the same issue and can’t afford all my user to have 2 graphics card ;-)

thanks for sharing anything you found so far !

Stephane

Yeah, its easy to solve. 1) Install linux 2) Disable X-windows from starting (i.e. set inittab default to 3 or remove xdm from the startup scripts). 3) Run your application without any launch timeouts!

Yeah, its easy to solve. 1) Install linux 2) Disable X-windows from starting (i.e. set inittab default to 3 or remove xdm from the startup scripts). 3) Run your application without any launch timeouts!