Question regarding device synchronization

Hello everyone,

I have a small doubt regarding the concept of device synchronization. Here’s the code I am working on,

do it = 1, max_iters

%%%%Some kernels outputting res_sqr_d%%%%

                        temp = 0.0
                        !$cuf kernel do <<< *, * >>>
                        do i = 1, points
                                temp = temp + res_sqr_d(i)
                        end do

                        sum_res_sqr = temp

                        residue = dsqrt(sum_res_sqr)/points

                        if (it .le. 2) then
                                res_old = residue
                                residue = 0.0d0
                                residue = dlog10(residue/res_old)
                        end if
                         print*, it, residue

end do

So initially in the main loop(1 to max_iters) i have some kernels giving res_sqr_d as the output. Later, i use reduction to obtain the sum of all the array elements into temp. Then i transfer temp to the host to perform some more operations and print the final residue value(all performed on the host).

Now initially for a few iterations, the values match with the serial code but later it produces incorrect results. It should be because asynchronous behavior between CPU and GPU. I ended up trying cudaDeviceSynchronize but it didnt help me either. Can you please give some suggestions regarding this?

Thank you


I don’t see anything obvious here.
Parallel reductions can certainly give you different answers than serial reductions. Are your differences due to floating point cancellation or round-off error?