I implemented rather complex kernel searching stars on astronomical images. While it works perfectly on small images with a few thousands stars, it generates cudaErrorLaunchFailure (result from subsequent cudaMemcpy call) on 4k x 4k image with ~50k stars. Unfortunately even after a week of experiments and googling (including this forum) I am not able to determine the cause.
I believe the problem is not caused by unaligned access (all local data are 4B ints and floats).
There is a dependence on kernel execution time, but there is no clear threshold. It is necessary for kernel to run more than ~2s to fail, but sometimes kernel fails after ~2.1s, on different image it works ok even if the execution time is ~2.8s. When the execution time is ~3 and more it almost always fails.
There is a dependence on image dimensions, but again not a clear one. Full 4k x 4k image always fails. Approx. 3k x 3k crop of ANY portion of image usually works. Smaller crops always works. When the image size is close to the edge, kernel usually works ok several times, then it fails.
There is no dependence on image content. Any portion of the 4k x 4k image can be processes, it is enough to crop any part of the image.
There is no dependence on any single dimension, 4k x 2k and 2k x 4k images always work ok.
I tried to terminate the kernel prematurely by returning from various parts of code, again without any observable pattern. When the execution time is short, error is never generated. When the execution time prolongs by including anther parts of the algorithms, error stars to occur on larger images. There is no specific portion of code to be performed to cause the error, any part of code works ok on smaller image.
I use CUDA v8 and the error occurrence depends on the used hardware. While the error always appear when kernel is launched on full 4k x 4k image on GTX650, the same image can be processes on GTX1060 ok 9-times from 10 tries.
Any hint would be very appreciated.