I’m very close to having a working program of what I’m trying to accomplish, but for some reason I’m getting
Cuda error: Kernel execution failed in file ‘testbed.cu’ in line 367 : the launch timed out and was terminated.
when trying to run the kernel (in debug mode), but the weird thing is that it only happens sometimes. The program should invoke the kernel and exit if it needs more data to analyze, get more data, then repeat the process. Sometimes it will run the kernel once, sometimes more than once, without any problems.
What could cause this to time out? If you need me to paste in any code, I can do that.
I have a similar problem. I get the same error when I reach a cetain number of threads. Above this number I always get the error. Below it, I have yet to have any problems. At this number, I get the error intermittently. I’ve also noticed that if I reduce the number of computations per thread (some times just by decreasing the size of a loop) that I can increase the number of threads I can run w/o getting the error.
You two may actually be having different issues. Most common causes of problems similar to yours:
tcullison:
too many registers per thread. You will not be able to launch if (registers per thread)x(threads per block) > 8192. Compile with -keep option and check the .cubin file to see how many registers are being used. You can try to reduce that with the -maxrregcount flag to nvcc (check nvcc documentation for details). Judging by your description, this most likely is your issue.
too much shared memory. A threadblock can use no more than 16K of shared memory, otherwise it fails to launch. Again, check the .cubin file, plus add whatever you’re allocating in shared mem dynamically.
nkohlmei:
your kernel runtime exceeds the time allowed by the watchdog mechanism. I believe that’s 5s in WinXP, not sure about the number in Linux.
I have since checked that (registers per thread)x(threads per block) is indeed less than 8192.
After checking the cubin file, I also believe that the amount of shared memory I’m using is OK as well (56 * 512 threads per block).
However, if I modify the code a little (so that I am not accessing as much global memory) the code works fine. I have used the -maxrregcount flag count with the same register restriction when compiling both the modified and unmodified code; yet, the modified code executes while the unmodified does not. For clarity, the modified code is defined as the code for which I am not accessing global memory as much and the loops in the modified code iterate as much or more than the loops in the unmodified code.
I’m not allocating any dynamic memory when calling the kernel. Unfortunatelly, I cannot post my code. Although some time this week, I might come up with a bit of code that has the same problem that I can post.
You never said if you solved your problem or not. I was having exactly the same issue. Kernels that should execute in < 10ms would randomly give the launch timeout error, but only every 1 in ~10,000 launches. In my case, it turned out to be an incorrectly installed driver, which was causing other issues too. Run nvidia-bug-report.sh and check for any API mismatch errors to see if you have the same root cause.
That seems odd. Are you running on the linux console mode so that you don’t have a 5s limitation to begin with?
And to clarify an old post of mine above: my particular problem was not solved by correctly installing the driver. It still persists even in the CUDA 1.1 beta and NVIDIA is working on a solution (crossing fingers that the solution will be in the 1.1 release version). The particular calling card of this problem is a kernel that normally executes in a very short time (ms), but if you call it 100,000 times in a row, even with the SAME DATA, it will only get to 10,000 or 20,000 calls before a launch takes 5s and then either gives “launch timeout” or “unspecified launch failure”. Running in the linux console (no 5s limitation) just causes the machine to hard lock when it reaches this point.
I reduced my block number by half and it can now barely run with less than 6.7 seconds. I think the performance is lowered but at least it can run. I heard about the 5 second limitation. But I don’t know why my computer/GPU can run up to a bit more than 7 seconds per kernel. I am using windows XP, running CUDA 1.1beta.
My computer is a Dual Xeon CPUs 3.0GHz dell precision 670 workstation. It is about 2.5 years old. GPU is G8800 Ultra. Just got it a few weeks ago.
To byung:
I do have two GPU cards but I have only one PCI-e slot in my computer. So at the time being this is not an affordable solution for me to buy a new computer. But thank you very much for the information.
Yes I’m currently running into the same issue as well. I’m bumping this to see if perhaps someone is the wiser on this issue?
A colleague of mine mentioned that if the card isn’t set to COMPUTE ONLY mode then the card will want to do some display related update after some time. I don’t know if that’s in any way related.
So, what you are saying is that if I have just one card that to the redenring I’m not able to write a kernel that needs more than 7 seconds to execute?
Do you know how to set the device to COMPUTE ONLY mode? I never hear somenthing like that before…
Yeah, its easy to solve. 1) Install linux 2) Disable X-windows from starting (i.e. set inittab default to 3 or remove xdm from the startup scripts). 3) Run your application without any launch timeouts!
Yeah, its easy to solve. 1) Install linux 2) Disable X-windows from starting (i.e. set inittab default to 3 or remove xdm from the startup scripts). 3) Run your application without any launch timeouts!