I’ve 2 questions:

  1. How use atomicAdd in my kernel? Do I need to include something because atomicAdd seems to not be defined?

  2. Do you think that the use of atomicAdd can speed-up this code:

for (i=0;i<N;i+=M){



transformed in

for (i=0;i<N;atomicAdd(&i,M)){





Atomics are only supported on compute 1.1. devices and later. You need to add “-arch sm_11” to the command line as described in the documentation.

To answer your question, no, atomicAdd is not likely to speed up that code. The “i” variable will be stored in a register per-thread and so the performance is already optimal.

  1. You need to add a compiler option to enable compilation for sm11 devices. I don’t remember the exact syntax: check the compiler help.

  2. Using atomicAdd in that situation will certainly slow your performance by a huge factor: With all threads contending for access to the variable i, the entire execution will essentially be serialized.

I think you are misunderstanding what atomicAdd is for, as ‘i’ looks like a local variable, and you can’t use atomicAdd with that.

atomicAdd, as all atomic functions, is used to modify global memory without causing any race condition. It is used for “protection”, so don’t expect to get better performance compared to non atomic functions.

Why does atomicAdd (or most of the other atomic functions) support ONLY integer types?

Because floating point operations are not commutative; a+b does not neccesarily equal b+a. This means that using atomic operations, even though they are atomic, can give a different result depending on the order of operations resulting from the parallelism.

Of course you could use fixed point…