Inconsistent behavior from CUDA

Hello all,

I’m experiencing some very unpredictable errors using the cuda compiler with the SDK custom build rule and visual C++ 2008. Simply adding and removing comments either breaks or fixes the kernel code (when it breaks, the cuda kernel fails and gives the unknown error message). Furthermore, the comment could have been added or removed in the .cu file OR in the .cpp file…changing comments in both of these places is causing unpredictable results.

The predictability of the program seems linked to the executable. After compiling, the resulting executable will behave in one of these ways:

  1. It will never crash
  2. It will always crash on certain images, but never on others
  3. It will crash roughly 50% of the time on certain images

After compiling, it consistently behaves in one of the above three ways. Once the project starts producing an unpredictable exe, I start removing bits of code until it works again. Eventually I will find some arbitrary line that seems to be a culprit. ie, commenting out the line avoids the instability. I test this out several times in a row…then add the culprit line back, and the problem has disappeared.

Because of this odd nature of the problem, I haven’t been able to make a test case…is anyone else having issues similar to this?

Uh, you have a segfault in your code? Doesn’t seem too unreasonable…

Edit - I’m not sure what was causing the apparent non-determinism, but I have solved what seems to be the major problem. Basically, I had a “__syncthread” call after I returned (for some threads that were out of range). This caused the synch to wait indefinitely (I had just assumed it would only wait on threads that had not returned already).

The behavior of __syncthreads() is undefined when not all threads in a CTA participate, even if some warps have exited.

Ah, ok…so if I want to use synch thread safely I basically have to do this?

__global__ void test(int size)


   int i = threadIdx.x;

   if(i < size)





   if (i < size)


	  //more stuff



Basically. This is the way that all examples do it. What i believe the __syncthreads() does is refuse to broadcast any addition instructions until it all threads in the warp reach the __syncthreads() point. If the threads never broadcast a __syncthreads(), as might happen when code branches.