Random Launch Failure

I have a suite of kernels I call one after another. All use the same global memory for input and write to their own global memory location for output. Sometimes I can run several hundred kernels before I get “unspecified launch failure.”, sometimes I can only run 3. What is the most likely dumb thing I could be doing?

Thanks!

Probably out of bounds memory access (the cuda version of a segfault). Whether it proves fatal or not probably depends on the serendipitous state of the device memory map or some rarely covered condition that doesn’t arise often. Try using something like gpu ocelot or valgrind in device emulation mode and see whether it detects anything.

The other alternative would be flaky hardware, which can happen. But I would be looking in a lot of other places before pointing any fingers in that direction.

Thanks, I will go have a look!