I have a pretty straightforward array processing kernel for 3D finite difference calculations which randomly fails, returning an unspecified launch failure. But it only does it when the kernel isn’t compiled for debugging.
In a marathon test run I ran a couple of days ago, the test app I have been using to exercise it ran 20 times using the same input data each time without fault when built for debugging (8 straight hours and over 300,000 kernel launches), After recompiling the application without debugging symbols, it produced unspecified launch failures 10 times out of 20, never failing in the same place twice. There doesn’t seem to be any correlation between kernel execution parameters and problem size either. It will happily crash on runs with a single block and a few thousand array elements as it will with hundreds of blocks and millions of elements. And it will randomly crash while processing the same input data sets in a tight loop. When it doesn’t crash, the final results are correct.
And failure is pretty catastrophic. If it runs on a card with an active display, it is usually hosed, and the driver reports hard errors in the kernel ring buffer like this:
[46389.480119] NVRM: Xid (0001:00): 13, 0003 00000000 000050c0 00000368 00000000 00000100
The behaviour is the same under both Cuda 2.2 and 2.3 on Linux (right now 64 bit Ubuntu 9.04 with 190.18 drivers). I have built the test application against ocelot and in emulation mode and run it with valgrind, and neither have ever detected any buffer overruns or memory errors in the kernel that fails. I am fast running out of ideas about where to look to solve this, given none of the debugging tools at my disposal (cuda-gdb, ocelot and valgrind) can detect anything erroneous. The kernel itself contains almost no conditional code paths, all data lives in device memory for the entire duration of the test program, and it does absolutely nothing exotic with memory or device pointers. The test app is single threaded and just uses the runtime API. I can post code if someone wants to take a look at it, but I am open to any suggestions about where to look or what to do to try and pin down where things are going wrong.