Get different results for every running with atomicAdd()

Hi,

We used CUDA function atomicADD() to implement dense matrix multiplication Y(N) = A(N*M) * X(M)

    int n = threadIdx.x + blockDim.x*blockIdx.x;
    int m = threadIdx.y + blockDim.y*blockIdx.y;
    float tmp = 0.0;
    if (n < N && m < M)
    {
        tmp = X[m] * A[n * M + m];
        atomicAdd(&Y[n], tmp);
    }
    __syncthreads();

The kernel is called from

   dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE);
    dim3 blockNum(N/BLOCK_SIZE+1,  M/BLOCK_SIZE+1);
   kernel<<<blockNum, blockSize>>>(X, Y, A, M, N);

We got different results for every run with this kernel.

Naturally the random issue can be related to our code such as cumulative floating point arithmetic error. However, there is not the issue if the kernel is replaced with cublas , or the other implementation,

    int n = threadIdx.x + blockIdx.x * blockDim.x;
    float tmp = 0.;
    if(n < N){
        for(int m = 0; m < M; m++)
            tmp += X[m] * A[n * M + m];
        Y[n] += tmp;
    }
    __syncthreads();

we always get exact same results for all runs.

Did we do anything wrong with atomicAdd() ? Any comments for using the function atomicAdd ?

Thanks. /Jing

The order of operations in the atomicAdd example:

  1. Is not guaranteed to be the same compared to the non-atomic example
  2. Is not guaranteed to be the same run-to-run.

Therefore, it is possible that the atomic example could vary, run-to-run. When multiple threads attempt to do an atomic operation on a particular location, the order of atomic process is undefined.

The reason why ordering can matter is covered here.

Therefore, it is possible that the atomic example could vary, run-to-run.

Thanks for the clarify. The information is very useful for our debugging. /Jing