Hi all,

I wrote a piece of code to sort an array of arbitrary size. I run the code over different problem sizes and some of my problem sizes(usually the large ones) will not sort properly. I think it must be data races since every time it happens on a different input. But the problem is when I run my code with cuda-memcheck, I don’t get any error and it sorts all my input perfectly. I ran my code several times on different inputs without a single error. So I was wondering how cuda-memcheck run executables that I get a correct answer?

Have you performed synchronization between host and device, i.e., insert cudaDeviceSynchronize() after kernel function to ensure all the tasks are completed in the device. I guesss cuda-memcheck has kind of implicit synchronization.