Hi,

We used CUDA function **atomicADD**() to implement dense matrix multiplication Y(N) = A(N*M) * X(M)

```
int n = threadIdx.x + blockDim.x*blockIdx.x;
int m = threadIdx.y + blockDim.y*blockIdx.y;
float tmp = 0.0;
if (n < N && m < M)
{
tmp = X[m] * A[n * M + m];
atomicAdd(&Y[n], tmp);
}
__syncthreads();
```

The kernel is called from

```
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE);
dim3 blockNum(N/BLOCK_SIZE+1, M/BLOCK_SIZE+1);
kernel<<<blockNum, blockSize>>>(X, Y, A, M, N);
```

We got different results for every run with this kernel.

Naturally the random issue can be related to our code such as cumulative floating point arithmetic error. However, there is not the issue if the kernel is replaced with **cublas** , or the other implementation,

```
int n = threadIdx.x + blockIdx.x * blockDim.x;
float tmp = 0.;
if(n < N){
for(int m = 0; m < M; m++)
tmp += X[m] * A[n * M + m];
Y[n] += tmp;
}
__syncthreads();
```

we always get exact same results for all runs.

Did we do anything wrong with atomicAdd() ? Any comments for using the function atomicAdd ?

Thanks. /Jing