Get different results for every running with atomicAdd()

gongjing · October 3, 2022, 12:31pm

Hi,

We used CUDA function atomicADD() to implement dense matrix multiplication Y(N) = A(N*M) * X(M)

    int n = threadIdx.x + blockDim.x*blockIdx.x;
    int m = threadIdx.y + blockDim.y*blockIdx.y;
    float tmp = 0.0;
    if (n < N && m < M)
    {
        tmp = X[m] * A[n * M + m];
        atomicAdd(&Y[n], tmp);
    }
    __syncthreads();

The kernel is called from

   dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE);
    dim3 blockNum(N/BLOCK_SIZE+1,  M/BLOCK_SIZE+1);
   kernel<<<blockNum, blockSize>>>(X, Y, A, M, N);

We got different results for every run with this kernel.

Naturally the random issue can be related to our code such as cumulative floating point arithmetic error. However, there is not the issue if the kernel is replaced with cublas , or the other implementation,

    int n = threadIdx.x + blockIdx.x * blockDim.x;
    float tmp = 0.;
    if(n < N){
        for(int m = 0; m < M; m++)
            tmp += X[m] * A[n * M + m];
        Y[n] += tmp;
    }
    __syncthreads();

we always get exact same results for all runs.

Did we do anything wrong with atomicAdd() ? Any comments for using the function atomicAdd ?

Thanks. /Jing

Robert_Crovella · October 3, 2022, 2:07pm

The order of operations in the atomicAdd example:

Is not guaranteed to be the same compared to the non-atomic example
Is not guaranteed to be the same run-to-run.

Therefore, it is possible that the atomic example could vary, run-to-run. When multiple threads attempt to do an atomic operation on a particular location, the order of atomic process is undefined.

The reason why ordering can matter is covered here.

gongjing · October 3, 2022, 2:37pm

Therefore, it is possible that the atomic example could vary, run-to-run.

Thanks for the clarify. The information is very useful for our debugging. /Jing

Topic		Replies	Views
atomicAdd and concurrent kernels CUDA Programming and Performance	5	2311	August 6, 2013
AtomicAdd result incorrect CUDA Programming and Performance	3	1596	December 29, 2018
atomicAdd() during loop not work well but at end work well CUDA Programming and Performance	3	1186	May 20, 2010
CUDA dot product atomics problem CUDA Programming and Performance	4	1851	February 26, 2012
Really simple while loop issues CUDA Programming and Performance	4	3151	October 27, 2014
can one force two operations to occur atomically together? CUDA Programming and Performance	2	1459	June 30, 2015
How can I make sure atomicAdd() was successful? CUDA Programming and Performance	4	3366	March 12, 2017
atomicAdds within two loops CUDA Programming and Performance	5	875	October 12, 2021
CUDA Matrix Multiplication Kernel Results Inconsistent when blockDim.z >1 CUDA Programming and Performance	2	738	January 28, 2018
AtomicAdd() functions CUDA Programming and Performance	1	753	December 9, 2016

Get different results for every running with atomicAdd()

Related topics