Hi,
I’m working on a simple parallel sorting network in shared memory and ran into a very unexpected problem. There seems to be a race condition occurring due to threads within a warp executing out of order. My program is a simple odd-even sorting network where each thread repeatedly does a compare-and-swap with pairs of elements. The input is 64 integers in shared memory and a single warp sorts them in-place. The sort function itself is below:
__device__ void sort64(int* data) {
for(int i=0; i<32; i++) {
swap(&data[2*threadIdx.x],&data[(2*threadIdx.x)+1]);
swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);
}
}
Note that no threads attempt to access the same element at once. This should sort the list, but it does not give the correct result. However, putting __syncthreads() after each call to swap produces the correctly sorted result. Based on my understanding of warps and SIMD, this should not happen: __syncthreads() is irrelevant when running a single warp since all threads run in lockstep. Running cuda-memcheck --tool racecheck tells me there are a bunch of races between writes and reads. However, this is not possible in the above program if threads are running in lockstep.
I have tried this program on 3 different GPUs (2 Kepler and 1 Maxwell) with the same result. Below is my complete test program:
#include<stdio.h>
__device__ void print_list(int* data){
if(threadIdx.x==0) {
for(int i=0; i<64; i++) {
printf("%d ", data[i]);
}
printf("\n\n");
}
}
__device__ void swap(int* a, int* b) {
int temp;
if(*a > *b) {
temp = *a;
*a = *b;
*b = temp;
}
}
__device__ void sort32(int* data) {
for(int i=0; i<16; i++) {
swap(&data[2*threadIdx.x],&data[(2*threadIdx.x)+1]);
// __syncthreads();
swap(&data[(2*threadIdx.x)+1],&data[(2*threadIdx.x)+2]);
// __syncthreads();
// These syncthreads cause the program to produce the correct result
}
}
__global__ void testKernel(void) {
__shared__ int data[65];
// Generate some un-sorted data...
data[2*threadIdx.x] = threadIdx.x+20;
data[(2*threadIdx.x)+1] = 100-threadIdx.x;
data[64]=99999; // easier than dealing with edge case
sort32(data);
print_list(data); // Should be sorted at this point.
}
int main(void) {
testKernel<<<1,32>>>(); // Just 1 warp!
printf("%d\n", cudaDeviceSynchronize());
return 0;
}
Any insight into what may be causing this would be greatly appreciated!
Thanks.
Ben