Atomics are only supported on compute 1.1. devices and later. You need to add “-arch sm_11” to the command line as described in the documentation.
To answer your question, no, atomicAdd is not likely to speed up that code. The “i” variable will be stored in a register per-thread and so the performance is already optimal.
You need to add a compiler option to enable compilation for sm11 devices. I don’t remember the exact syntax: check the compiler help.
Using atomicAdd in that situation will certainly slow your performance by a huge factor: With all threads contending for access to the variable i, the entire execution will essentially be serialized.
I think you are misunderstanding what atomicAdd is for, as ‘i’ looks like a local variable, and you can’t use atomicAdd with that.
atomicAdd, as all atomic functions, is used to modify global memory without causing any race condition. It is used for “protection”, so don’t expect to get better performance compared to non atomic functions.
Because floating point operations are not commutative; a+b does not neccesarily equal b+a. This means that using atomic operations, even though they are atomic, can give a different result depending on the order of operations resulting from the parallelism.