Hello everyone,
I have a small doubt regarding the concept of device synchronization. Here’s the code I am working on,
do it = 1, max_iters
%%%%Some kernels outputting res_sqr_d%%%%
temp = 0.0
!$cuf kernel do <<< *, * >>>
do i = 1, points
temp = temp + res_sqr_d(i)
end do
sum_res_sqr = temp
residue = dsqrt(sum_res_sqr)/points
if (it .le. 2) then
res_old = residue
residue = 0.0d0
else
residue = dlog10(residue/res_old)
end if
print*, it, residue
end do
So initially in the main loop(1 to max_iters) i have some kernels giving res_sqr_d as the output. Later, i use reduction to obtain the sum of all the array elements into temp. Then i transfer temp to the host to perform some more operations and print the final residue value(all performed on the host).
Now initially for a few iterations, the values match with the serial code but later it produces incorrect results. It should be because asynchronous behavior between CPU and GPU. I ended up trying cudaDeviceSynchronize but it didnt help me either. Can you please give some suggestions regarding this?
Thank you
Srikanth