I’m trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i’m doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn’t seem to be working properly since every time i run my kernel with the same data,i receive different results. I’ll be grateful if somebody could spot the mistake or provide an alternative solution! Here is my kernel:

Welcome to the wonderful world of floating point arithmetics! The rounding error of the sum of more than two floating point numbers depends on the order of the operations as floating point addition is not associative. The order however is undefined by definition if you have any need for atomic operations at all.

To get reproducible results extend the reduction scheme all the way to the end result by either a) doing the sum over block results on the CPU or b) launching a second kernel with a single block to sum the block results or c) have the last finishing block do the sum. c) is more complicated than the other options, but it is described in appendix B.5 of the Programming Guide.

Thanks a lot for your fast reply!So the obvious reason i get unreproducible results is the roundoff errors?But by using double precision variables,have i not minimized the effect of rounding on my results?I mean,what’s the difference between adding the gridDim.x numbers on the CPU or by an atomic operation on the GPU.With double precision,the 64 bytes-variables additions are not implemented with the same algorithm?

Floating point addition gives the same results on the CPU and GPU provided the order of operations is the same (which it often isn’t because you want to arrange the algorithm on the GPU for maximal parallelism).

Well there still is the possibility that something is wrong in your program. Do you clear [font=“Courier New”]*dot_res[/font] before invoking your kernel?

Yeah!I actually found my mistake…When calling the cuda_atomicAdd function, i was returning the result back to the *dot_res variable,something wrong and unnecessary, since the value of *dot_res was updated internally!The right code would be:

Kernel Call:

Atomic function:

With this minor change,everything works flawlessly!Thanks anyway