Race Condition CUDA?

Hello All,

I am new to this forum but would like to say ‘hi’ and help with what is likely a simple problem but one that is giving me some issues. I am creating some code with CUDA that will execute using an array but can return wrong results - depending on the size of the problem. I think it could be a race condition issue but am unsure.

Can anyone look at the sample code and let me know if there is an obvious problem?

Thank you.

#define N 1024
#define TPB 512

__global__ void func(float4 *a, float *res, int n) {
  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  if(idx < n) {
    for(int j = idx + 1; j < n; ++j) {
        res[idx] += a[idx].w*(a[idx].x - a[j].x);
        res[j] += a[j].w*(a[j].x - a[idx].x);
    }
  }
}

float4 *A;  // put data in before calling GPU Kernel
...

float *result = new float[N];
...

func<<<N/TPB, TPB>>>(dev_A, dev_res, N);

Yes, you have a race condition. Consider the thread for which idx = 0 and the thread for which idx = 32.

Thread 32 can write to res[32] based on the first line of the body of the for-loop.

Thread 0 can write to res[32] based on the second line of the body of the for-loop.

There is no guarantee in which order threads will execute, and in fact these two threads could be executing "at the same time. Both could be trying to update res[32] simultaneously. The result will depend on who “wins” this race, and there is no way to tell or predict who will win

Then you may get varying results from run to run.

It is a race condition.

That’s what I thought.

Thank you.

P.S. Any suggestion as to how to mitigate this with a simple code change or should I just rethink the algorithm itself?

global atomics to the rescue!

Thanks, global atomics worked.