I have a small doubt regarding the concept of device synchronization. Here’s the code I am working on,
do it = 1, max_iters %%%%Some kernels outputting res_sqr_d%%%% temp = 0.0 !$cuf kernel do <<< *, * >>> do i = 1, points temp = temp + res_sqr_d(i) end do sum_res_sqr = temp residue = dsqrt(sum_res_sqr)/points if (it .le. 2) then res_old = residue residue = 0.0d0 else residue = dlog10(residue/res_old) end if print*, it, residue end do
So initially in the main loop(1 to max_iters) i have some kernels giving res_sqr_d as the output. Later, i use reduction to obtain the sum of all the array elements into temp. Then i transfer temp to the host to perform some more operations and print the final residue value(all performed on the host).
Now initially for a few iterations, the values match with the serial code but later it produces incorrect results. It should be because asynchronous behavior between CPU and GPU. I ended up trying cudaDeviceSynchronize but it didnt help me either. Can you please give some suggestions regarding this?