Race condition within warp

Hi,
I’m working on a simple parallel sorting network in shared memory and ran into a very unexpected problem. There seems to be a race condition occurring due to threads within a warp executing out of order. My program is a simple odd-even sorting network where each thread repeatedly does a compare-and-swap with pairs of elements. The input is 64 integers in shared memory and a single warp sorts them in-place. The sort function itself is below:

__device__ void sort64(int* data) {
  for(int i=0; i<32; i++) {
    swap(&data[2*threadIdx.x],&data[(2*threadIdx.x)+1]);
    swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);
  }
}

Note that no threads attempt to access the same element at once. This should sort the list, but it does not give the correct result. However, putting __syncthreads() after each call to swap produces the correctly sorted result. Based on my understanding of warps and SIMD, this should not happen: __syncthreads() is irrelevant when running a single warp since all threads run in lockstep. Running cuda-memcheck --tool racecheck tells me there are a bunch of races between writes and reads. However, this is not possible in the above program if threads are running in lockstep.

I have tried this program on 3 different GPUs (2 Kepler and 1 Maxwell) with the same result. Below is my complete test program:

#include<stdio.h>

__device__ void print_list(int* data){
  if(threadIdx.x==0) {
    for(int i=0; i<64; i++) {
      printf("%d ", data[i]);
    }
    printf("\n\n");
  }
}

__device__ void swap(int* a, int* b) {
  int temp;
  if(*a > *b) {
    temp = *a;
    *a = *b;
    *b = temp;
  }
}

__device__ void sort32(int* data) {
  for(int i=0; i<16; i++) {
    swap(&data[2*threadIdx.x],&data[(2*threadIdx.x)+1]);
//    __syncthreads();
    swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);
//    __syncthreads();
// These syncthreads cause the program to produce the correct result
  }
}

__global__ void testKernel(void) {
  __shared__ int data[65];

// Generate some un-sorted data...
  data[2*threadIdx.x] = threadIdx.x+20;
  data[(2*threadIdx.x)+1] = 100-threadIdx.x;
  data[64]=99999; // easier than dealing with edge case

  sort32(data);

  print_list(data); // Should be sorted at this point.
}

int main(void) {
  testKernel<<<1,32>>>(); // Just 1 warp!
  printf("%d\n", cudaDeviceSynchronize());
  return 0;
}

Any insight into what may be causing this would be greatly appreciated!

Thanks.
Ben

__syncthreads does 2 things (read the documentation on it!)

  1. execution barrier
  2. memory fence

Unless you mark a shared memory location as volatile, the compiler is free to optimize any shared memory location into a register, eliminating all reads and/or writes to that location, regardless of which threads may be doing these operations. (read the documentation about volatile also…) This “optimization” is (partially) breaking your code, and would even break a regular warp-synchronous reduction. Take a look at the slide 22 here:

http://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Note that that is a warp-synchronous section of code. It would break without volatile for the same reason your code is (partially) breaking without volatile.

OTOH, __syncthreads() has a memory barrier effect, forcing such register-held values back into their “rightful” locations in shared memory.

Now regarding this:

Not for me it doesn’t. Tested with CUDA 7.5 on both Kepler and Fermi.

However, if you mark the shared memory declaration as volatile, and then go through and make the necessary code changes to accommodate this, I get the same result using application of volatile (without uncommenting __syncthreads()) as I do with only uncommenting __syncthreads().

So I believe there are multiple defects in this code.

Here’s my test case with __syncthreads() uncommented:

$ cat t1239.cu
#include<stdio.h>

__device__ void print_list(int* data){
  if(threadIdx.x==0) {
    for(int i=0; i<64; i++) {
      printf("%d ", data[i]);
    }
    printf("\n\n");
  }
}

__device__ void swap(int* a, int* b) {
  int temp;
  if(*a > *b) {
    temp = *a;
    *a = *b;
    *b = temp;
  }
}

__device__ void sort32(int* data) {
  for(int i=0; i<16; i++) {
    swap(&data[2*threadIdx.x],&data[(2*threadIdx.x)+1]);
    __syncthreads();
    swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);
    __syncthreads();
// These syncthreads cause the program to produce the correct result
  }
}

__global__ void testKernel(void) {
  __shared__ int data[65];

// Generate some un-sorted data...
  data[2*threadIdx.x] = threadIdx.x+20;
  data[(2*threadIdx.x)+1] = 100-threadIdx.x;
  data[64]=99999; // easier than dealing with edge case

  sort32(data);

  print_list(data); // Should be sorted at this point.
}

int main(void) {
  testKernel<<<1,32>>>(); // Just 1 warp!
  printf("%d\n", cudaDeviceSynchronize());
  return 0;
}
$ nvcc -o t1239 t1239.cu
$ ./t1239
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 100 69 99 70 98 71 97 72 96 73 95 74 94 75 93 76 92 77 91 78 90 79 89 80 88 81 87 82 86 83 85 84

0
$

Note the final result is not properly sorted.

Here’s my test case with volatile applied, but without uncommenting __syncthreads():

$ cat t1238.cu
#include<stdio.h>

__device__ void print_list(volatile int* data){
  if(threadIdx.x==0) {
    for(int i=0; i<64; i++) {
      printf("%d ", data[i]);
    }
    printf("\n\n");
  }
}

__device__ void swap(volatile int* a, volatile int* b) {
  int temp;
  if(*a > *b) {
    temp = *a;
    *a = *b;
    *b = temp;
  }
}

__device__ void sort32(volatile int* data) {
  for(int i=0; i<16; i++) {
    swap(&data[2*threadIdx.x],&data[(2*threadIdx.x)+1]);
//    __syncthreads();
    swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);
//    __syncthreads();
// These syncthreads cause the program to produce the correct result
  }
}

__global__ void testKernel(void) {
  volatile __shared__ int data[65];

// Generate some un-sorted data...
  data[2*threadIdx.x] = threadIdx.x+20;
  data[(2*threadIdx.x)+1] = 100-threadIdx.x;
  data[64]=99999; // easier than dealing with edge case

  sort32(data);

  print_list(data); // Should be sorted at this point.
}

int main(void) {
  testKernel<<<1,32>>>(); // Just 1 warp!
  printf("%d\n", cudaDeviceSynchronize());
  return 0;
}
$ nvcc -o t1238 t1238.cu
$ ./t1238
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 100 69 99 70 98 71 97 72 96 73 95 74 94 75 93 76 92 77 91 78 90 79 89 80 88 81 87 82 86 83 85 84

0
$

The outputs are identical between these two cases, meaning insofar as they have an effect on the output, the effect of volatile is the same as the effect of syncthreads. Therefore the benefit of syncthreads you are witnessing is not the execution barrier, but the avoidance of the unwanted memory->register “optimization”.

In my experience, cuda-memcheck never produces false positives for race conditions. From a cursory look, the code seems to have an inter-thread dependency here:

swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);

Keep in mind that the compiler does not know anything about how threads are mapped to data, that is a runtime thing. So it can’t see the inter-thread dependency. The compiler optimizes based on a single-thread view of the world.

What it likely does here is unroll the loop completely, then pull as many loads as possible to the start of the code in order to cover load latencies. Note that swap() is not an atomic operation, so the load and the store parts can be separated by fairly large distances in the final instruction schedule. This can pull loads which you intended to have a dependency on a previous write by another thread ahead of those writes.

In other words: Instructions executing in lockstep are a necessary, but not sufficient, condition to make this warp-synchronous code work, and what you are encountering is a common fallacy about such code. A classical work-around (I personally consider it more of a hack), is to declare the shared data as volatile, which inhibits the movement of loads relative to stores. The “volatile” modifier basically indicates to the compiler that some agent outside of the scope of the code it sees can modify the data; as a result it inhibits “caching” that data in registers and the related early scheduling of loads.

CUDA also has __threadfence_block function:

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions

In my experience, any of 3 - volatile, syncthreads, __threadfence_block may be faster, depending on algo. It’s pity that there is no __threadfence_WARP which should just force compiler to flush registers to memory

Thanks for the quick feedback! I didn’t realize that the compiler would be so cavalier to allow threads to cache shared memory in registers to the point where it would alter the program result. Defining the shared memory as volatile is a good solution, thanks!

Ben

I don’t think the use of the word “cavalier” is appropriate. C++ provides an abstract semantic model that describes program behavior. As long as the compiler does not violate the model, that is, the generated code behaves “as if” it is following the abstract model verbatim, it is free to apply any code transformation it wishes.

If code contains and relies on data or control dependencies that are not expressed in the source code, the compiler cannot be held accountable for resulting program failures. “Do what I mean, not what I wrote” is not a workable principle. Just like multi-threaded CPU programming requires synchronization primitives, this applies to CUDA as well. Various ways of taking care of dependencies have been discussed above, with __syncthreads() being the most basic one whose use I recommend to CUDA beginners until they have developed a deeper and more detailed understanding of the details of the CUDA execution model.

When it comes to warp-synchronous programming my philosophy can be summed up in these two rules (which I adapted from a source I can’t recall):

(1) Don’t do it
(2) [Experts only] Don’t do it yet

Why not? If you have full Control of CUDA toolkit version and which arch you are compiling for, I really don’t see what the risks are?

To my understanding, warp-synchronous code will only break down on potential new architectures?

in the future, for those who are considering warp-synchronous for new development, my recommendation would be to check on the status of cooperative groups for CUDA:

https://asc.llnl.gov/DOE-COE-Mtg-2016/talks/1-10_NVIDIA.pdf

not available in CUDA 8, but as near as I can tell, they’re coming.

With respect to the risks, this thread is an example of “the risks”.

i.e. the risk is lack of understanding of the relationship between various aspects of programming that are needed to make it work or may cause it to break. Many folks don’t really even grasp the concept of lockstep execution, and believe that having multiple threads in the same warp contend for independent locks (or the same lock !!!) is a good idea. I maintain that it is not a good idea. It can be done, perhaps “safely”, but requires such an intimate knowledge of what is going on, and extremely careful analysis to ensure that it was done correctly, that it is impractical (IMO) for safe, easy GPU code development. Even when done right, it’s what I call fragile code. The person who comes after you to maintain it may very well break it. warp-synchronous in general might not be as fragile, but it still has some gotchas as indicated in this thread.

As long as you are aware of all the issues, it should be possible to safely use warp-synchronous methods, in a controlled setting.

Thanks for the link! It overlaps with the details in the CUDA 8 blog post. I assume we’ll know more when CUDA 8 is released (since the new features and especially Pascal documentation are not in the current release candidate) I assume that in the linked PDF, when Cyril mentions “Pascal” he means P100 only, not GP102/GP104/GP106; the blog post was more consistent about identifying the changed memory, warp, and scheduling features as being for P100, not the whole Pascal series.

Cooperative groups are in the “beyond CUDA 8” section of that llnl presentation AFAICT (and not in the CUDA 8 blog), so I don’t expect significant additional details about it when CUDA 8 Production release docs go public.

I haven’t researched everything in the llnl presentation with respect to “Pascal” to discover which Pascal GPUs it applies to.