atomicAdd() during loop not work well but at end work well


i have a problem with atomicAdd() function. my kernel is like this:

int value = 0;

int i = blockIdx.x * blockDim.x + threadIdx.x;

if (i < N *M ){

value = atomicAdd((int *)&old[0], 1);

C[i] =  value;


i expect that the result be like this:

C[0] = 0, C[1] = 0, … C[i] = i , C[NM-1] = NM-1.

but results are like this only for some element of matrix C such as C[NM-1],C[NM-2], … C[NM-k] but for C[0], C[1], … C[NM-k-1] the value that saved in each element is not the expected value mentioned above. how can i solve this problem to my kernel work well?

The order of execution for cuda threads cannot be defined.

You are trying to add 1 to an existing variable using atomic funtion, expecting that all the threads will get invoked in the order 0, 1, 2, …N*M-k-1. But this is not the case, the threads will get invoked in some other order. This is the reason why you are getting unexpected results.

May be, if you try this in device emulation mode you should get the expected result, since the threads will get executed in 0, 1, 2, …N*M-k-1 order there…

You might be interested in this thread, which discussed basically the same thing.