Hi,
We used CUDA function atomicADD() to implement dense matrix multiplication Y(N) = A(N*M) * X(M)
int n = threadIdx.x + blockDim.x*blockIdx.x;
int m = threadIdx.y + blockDim.y*blockIdx.y;
float tmp = 0.0;
if (n < N && m < M)
{
tmp = X[m] * A[n * M + m];
atomicAdd(&Y[n], tmp);
}
__syncthreads();
The kernel is called from
dim3 blockSize(BLOCK_SIZE, BLOCK_SIZE);
dim3 blockNum(N/BLOCK_SIZE+1, M/BLOCK_SIZE+1);
kernel<<<blockNum, blockSize>>>(X, Y, A, M, N);
We got different results for every run with this kernel.
Naturally the random issue can be related to our code such as cumulative floating point arithmetic error. However, there is not the issue if the kernel is replaced with cublas , or the other implementation,
int n = threadIdx.x + blockIdx.x * blockDim.x;
float tmp = 0.;
if(n < N){
for(int m = 0; m < M; m++)
tmp += X[m] * A[n * M + m];
Y[n] += tmp;
}
__syncthreads();
we always get exact same results for all runs.
Did we do anything wrong with atomicAdd() ? Any comments for using the function atomicAdd ?
Thanks. /Jing