Result of atomicAdd in kernel

Hi, I’m trying to do atomicAdd in a kernel, the code is as below:

#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

__global__
void test(int* d_a) {
    int laneid = threadIdx.x;
    if(threadIdx.x == 0)
        atomicAdd(d_a, 1);
    printf("bid: %d threadIdx: %d d_a: %d\n", blockIdx.x, laneid, d_a[0]);
}

int main() {

    int* a = (int*)malloc(sizeof(int) * 16);
    for(int i = 0; i < 16; ++i) {
        a[i] = 0;
    }
    int* d_a;
    cudaMalloc((void**)&d_a, sizeof(int) * 16);
    cudaMemcpy(d_a, a, sizeof(int) * 16, cudaMemcpyHostToDevice);

    // test<<<1, 32>>>(d_a);
    for(int i = 0; i < 4; ++i) {
        test<<<1, 32>>>(d_a);
        cudaDeviceSynchronize();
    }
    return 0;
}

I hope that the result would be 1, but it seems that it is related to the loop times on the device, when I run nvcc -o test test.cu and then run compute-sanitizer ./test, the result would be 4 if the loop times is 4, and 2 if the loop times if 2. Here is the result with loop times is 4:

========= COMPUTE-SANITIZER
bid: 0 threadIdx: 0 d_a: 1
bid: 0 threadIdx: 1 d_a: 1
bid: 0 threadIdx: 2 d_a: 1
bid: 0 threadIdx: 3 d_a: 1
bid: 0 threadIdx: 4 d_a: 1
bid: 0 threadIdx: 5 d_a: 1
bid: 0 threadIdx: 6 d_a: 1
bid: 0 threadIdx: 7 d_a: 1
bid: 0 threadIdx: 8 d_a: 1
bid: 0 threadIdx: 9 d_a: 1
bid: 0 threadIdx: 10 d_a: 1
bid: 0 threadIdx: 11 d_a: 1
bid: 0 threadIdx: 12 d_a: 1
bid: 0 threadIdx: 13 d_a: 1
bid: 0 threadIdx: 14 d_a: 1
bid: 0 threadIdx: 15 d_a: 1
bid: 0 threadIdx: 16 d_a: 1
bid: 0 threadIdx: 17 d_a: 1
bid: 0 threadIdx: 18 d_a: 1
bid: 0 threadIdx: 19 d_a: 1
bid: 0 threadIdx: 20 d_a: 1
bid: 0 threadIdx: 21 d_a: 1
bid: 0 threadIdx: 22 d_a: 1
bid: 0 threadIdx: 23 d_a: 1
bid: 0 threadIdx: 24 d_a: 1
bid: 0 threadIdx: 25 d_a: 1
bid: 0 threadIdx: 26 d_a: 1
bid: 0 threadIdx: 27 d_a: 1
bid: 0 threadIdx: 28 d_a: 1
bid: 0 threadIdx: 29 d_a: 1
bid: 0 threadIdx: 30 d_a: 1
bid: 0 threadIdx: 31 d_a: 1
bid: 0 threadIdx: 0 d_a: 2
bid: 0 threadIdx: 1 d_a: 2
bid: 0 threadIdx: 2 d_a: 2
bid: 0 threadIdx: 3 d_a: 2
bid: 0 threadIdx: 4 d_a: 2
bid: 0 threadIdx: 5 d_a: 2
bid: 0 threadIdx: 6 d_a: 2
bid: 0 threadIdx: 7 d_a: 2
bid: 0 threadIdx: 8 d_a: 2
bid: 0 threadIdx: 9 d_a: 2
bid: 0 threadIdx: 10 d_a: 2
bid: 0 threadIdx: 11 d_a: 2
bid: 0 threadIdx: 12 d_a: 2
bid: 0 threadIdx: 13 d_a: 2
bid: 0 threadIdx: 14 d_a: 2
bid: 0 threadIdx: 15 d_a: 2
bid: 0 threadIdx: 16 d_a: 2
bid: 0 threadIdx: 17 d_a: 2
bid: 0 threadIdx: 18 d_a: 2
bid: 0 threadIdx: 19 d_a: 2
bid: 0 threadIdx: 20 d_a: 2
bid: 0 threadIdx: 21 d_a: 2
bid: 0 threadIdx: 22 d_a: 2
bid: 0 threadIdx: 23 d_a: 2
bid: 0 threadIdx: 24 d_a: 2
bid: 0 threadIdx: 25 d_a: 2
bid: 0 threadIdx: 26 d_a: 2
bid: 0 threadIdx: 27 d_a: 2
bid: 0 threadIdx: 28 d_a: 2
bid: 0 threadIdx: 29 d_a: 2
bid: 0 threadIdx: 30 d_a: 2
bid: 0 threadIdx: 31 d_a: 2
bid: 0 threadIdx: 0 d_a: 3
bid: 0 threadIdx: 1 d_a: 3
bid: 0 threadIdx: 2 d_a: 3
bid: 0 threadIdx: 3 d_a: 3
bid: 0 threadIdx: 4 d_a: 3
bid: 0 threadIdx: 5 d_a: 3
bid: 0 threadIdx: 6 d_a: 3
bid: 0 threadIdx: 7 d_a: 3
bid: 0 threadIdx: 8 d_a: 3
bid: 0 threadIdx: 9 d_a: 3
bid: 0 threadIdx: 10 d_a: 3
bid: 0 threadIdx: 11 d_a: 3
bid: 0 threadIdx: 12 d_a: 3
bid: 0 threadIdx: 13 d_a: 3
bid: 0 threadIdx: 14 d_a: 3
bid: 0 threadIdx: 15 d_a: 3
bid: 0 threadIdx: 16 d_a: 3
bid: 0 threadIdx: 17 d_a: 3
bid: 0 threadIdx: 18 d_a: 3
bid: 0 threadIdx: 19 d_a: 3
bid: 0 threadIdx: 20 d_a: 3
bid: 0 threadIdx: 21 d_a: 3
bid: 0 threadIdx: 22 d_a: 3
bid: 0 threadIdx: 23 d_a: 3
bid: 0 threadIdx: 24 d_a: 3
bid: 0 threadIdx: 25 d_a: 3
bid: 0 threadIdx: 26 d_a: 3
bid: 0 threadIdx: 27 d_a: 3
bid: 0 threadIdx: 28 d_a: 3
bid: 0 threadIdx: 29 d_a: 3
bid: 0 threadIdx: 30 d_a: 3
bid: 0 threadIdx: 31 d_a: 3
bid: 0 threadIdx: 0 d_a: 4
bid: 0 threadIdx: 1 d_a: 4
bid: 0 threadIdx: 2 d_a: 4
bid: 0 threadIdx: 3 d_a: 4
bid: 0 threadIdx: 4 d_a: 4
bid: 0 threadIdx: 5 d_a: 4
bid: 0 threadIdx: 6 d_a: 4
bid: 0 threadIdx: 7 d_a: 4
bid: 0 threadIdx: 8 d_a: 4
bid: 0 threadIdx: 9 d_a: 4
bid: 0 threadIdx: 10 d_a: 4
bid: 0 threadIdx: 11 d_a: 4
bid: 0 threadIdx: 12 d_a: 4
bid: 0 threadIdx: 13 d_a: 4
bid: 0 threadIdx: 14 d_a: 4
bid: 0 threadIdx: 15 d_a: 4
bid: 0 threadIdx: 16 d_a: 4
bid: 0 threadIdx: 17 d_a: 4
bid: 0 threadIdx: 18 d_a: 4
bid: 0 threadIdx: 19 d_a: 4
bid: 0 threadIdx: 20 d_a: 4
bid: 0 threadIdx: 21 d_a: 4
bid: 0 threadIdx: 22 d_a: 4
bid: 0 threadIdx: 23 d_a: 4
bid: 0 threadIdx: 24 d_a: 4
bid: 0 threadIdx: 25 d_a: 4
bid: 0 threadIdx: 26 d_a: 4
bid: 0 threadIdx: 27 d_a: 4
bid: 0 threadIdx: 28 d_a: 4
bid: 0 threadIdx: 29 d_a: 4
bid: 0 threadIdx: 30 d_a: 4
bid: 0 threadIdx: 31 d_a: 4
========= ERROR SUMMARY: 0 errors

Anyone could help to answer this question?

If you initialize a counter with 0, and add 1 N times, what result do you expect other than N?

Hi, I just run the kernel 4 times, each time I send d_a to the kernel which is initialized with 0. So I think the result will not be affected by the loop times? Is there any special mechanics in CUDA? Such as d_a is modified by each loop, and I just send the modified data to the next kernel?

No, it is not initialized each time. It is only initialized once. If you initialize it each time (at each loop iteration) before running the kernel in that loop iteration, then you will get an ending value of 1.

If you want to initialize it each time (at each loop iteration) then instead of doing this:

do this:

// test<<<1, 32>>>(d_a);
for(int i = 0; i < 4; ++i) {
    cudaMemcpy(d_a, a, sizeof(int) * 16, cudaMemcpyHostToDevice);
    test<<<1, 32>>>(d_a);
    cudaDeviceSynchronize();
}
1 Like

Got it, thanks for your reply~

Hi, BTW, if I want to profile the kernel execution time accurately, and I want to loop the kernel N times to make sure the execution time has less error, just like:

time.Start();
for(int i = 0; i < N; ++i) {
    // cudaMemcpy(d_a, a, sizeof(int) * 16, cudaMemcpyHostToDevice);
    test<<<1, 32>>>(d_a);
    cudaDeviceSynchronize();
}
time.Stop();

double exe_time = time.Elapsed() / N;

what can I do to make sure the result of d_a is correct and the execution time is more accurate in the meanwhile?

One way to do it:

You could use a different memory array as input for each loop iteration.
If d_a is not too large, create an array of device pointers.
You can copy and initialize all of them in advance.

int* d_a[4];

And then call:

    test<<<1, 32>>>(d_a[i]);

within the loop body.

Is there any solution for large d_a?

For large d_a the kernel typically runs longer, so you do notnecessarily need a loop for timing or you use a loop to get several separate timings (each without the copy), which you average.

Got it, thanks for your help~

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.