Complex addition as Atomic operation

Hi,
I have to add two complex double numbers atomically ( All my threads compute a complex double value and add it to the same memory location ). Unfortunately it does not appear as if CUDA provides an atomic routine for that ( I am running on Tesla C1060, but I dont think even Fermi has anything for this ). Any suggestions on how I could get this done ? Is there a way I could manually make an operation atomic ?

Thank you
akjha

Just do atomic additions of real and imaginary parts separately.

Or store the result from each thread (or each block, after a reduction in shared memory) into an array and start a separate reduction kernel. This has the added advantage of producing same results on repeated invocations, which might help debugging.

Just do atomic additions of real and imaginary parts separately.

Or store the result from each thread (or each block, after a reduction in shared memory) into an array and start a separate reduction kernel. This has the added advantage of producing same results on repeated invocations, which might help debugging.

Isn’t that the problem though? Fermi supports single precision floating point atomic add, but not double precision.

Isn’t that the problem though? Fermi supports single precision floating point atomic add, but not double precision.

Ah ok. It’s not as fast as single precision atomicAdd(), but it can be done in software.

Ah ok. It’s not as fast as single precision atomicAdd(), but it can be done in software.

Thank you for the replies. I tried searching for how to do it in software, and am citing the link below in case anyone else needs to do it.

http://code.google.com/p/cusp-library/sour…10b7846e8540e7e

However, it was not very useful for my case, since the thread started taking slightly longer to run. Since I was calling these set of threads multiple number of times, therefore the program execution time increased considerably.

Thank you for the replies. I tried searching for how to do it in software, and am citing the link below in case anyone else needs to do it.

http://code.google.com/p/cusp-library/sour…10b7846e8540e7e

However, it was not very useful for my case, since the thread started taking slightly longer to run. Since I was calling these set of threads multiple number of times, therefore the program execution time increased considerably.

Oh, I didn’t pay enough attention to the fact that all threads add to the same memory location. In that case, you should definitely do a per-block reduction in shared memory before adding the block’s result to the global variable in order to reduce contention.

Oh, I didn’t pay enough attention to the fact that all threads add to the same memory location. In that case, you should definitely do a per-block reduction in shared memory before adding the block’s result to the global variable in order to reduce contention.