atomicAdd not behaving as expected, atomicAdd_system not defined

Hello,

I’ve got many blocks to work on a small array of data. I used local memory to calculate everything and after that, write it to the small global array using an atomicAdd to have no race conditions. Everytime I run my code, I get different outputs. So I tested the atomicAdd function using a small Kernel:

global void test(float * arr) {
atomicAdd(&arr[0], threadIdx.x);
}

test << <1000, 1024 >> > (x_train);

So I start 1000 Blocks with 1024 Threads. Theres always the same amount of Threads, with Ids from 0 to 1023. Shouldnt the output be always the same?

If I try this:

global void test(float * arr) {
atomicAdd(&arr[0], 1);
}

The output is 1024000, which is correct. What is wrong here?

Second thing is, Ive tried using __atomicAdd or atomicAdd_system exactly like in the documentation but recieve “error: identifier “atomicAdd_system” is undefined”. Using a Quadro RTX 6000.

Thanks for any help!

You’re exceeding what can be represented in a float variable with full accuracy.

The sum of a single block using threadIdx.x is 1023x1024/2 = 523776. If you do that over 1000 such blocks, it would be 523776000. But a float variable only has 23 (or 24) significand bits. The result is that after about 16 million, the sum can no longer be precisely accurate in all/arbitrary cases. 523776000 is larger than 16 million, whereas 1024000 is smaller than 16 million.

Try switching your arr variable to type double.

In addition to being “incorrect” one reason why the output variable result varies from run to run could be that CUDA provides no specified order of thread execution, coupled with the aforementioned resolution issue, coupled with this. Once the addition operation becomes limited by the float resolution in the significand, then the order of operations matters to determine exactly which “incorrect” result you will get.

Second thing is, Ive tried using __atomicAdd or atomicAdd_system exactly like in the documentation but recieve “error: identifier “atomicAdd_system” is undefined”. Using a Quadro RTX 6000.

Compile for the correct arch matching the GPU. So for RTX 6000 that would be something like -arch=sm_75. You only have access to atomic operations if you are compiling for an architecture that supports the requested atomic.

1 Like

This makes sense, I’m stupid.
Thank you very much for your help!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.