my kernel has a for-loop
Although divergent code is not suggested by CUDA, but the break statement achieves 2X speedup in my case.
The program works fine for many data sets for many days. however, recently it reported a “unspecified launch failure” error for some data set.
The most strange thing is, if I comment out the “if(x==0) break” line, the program will run correctly.
So what is the possible reason ?