Synchronization in CUDA

Hello,

I know that synchronization in CUDA only works inside blocks. I tried adding a global variable that is incremented each time a block finishes its work, and this variable is checked by each thread to continue to the next step. However, this does not seem to work, as results are not correct. An example of the Fortran code I’m working with is below:

do i=1,n
    A(i) = calcA(i);
end do
// need synchronization here before computing minimum
minA = min(A);

do i=1,n
    // minimum is used here
    B(i) = calcB(i, minA)
end do
// need other synchronization here before computing array C

do i=1,n
    // C(i) is computed from different positions in B
    // for example: C(100) uses B(47), B(68), B(54), etc
    C(i) = calcC(i, minA, B)
end do

I solved this problem using three kernels, but the number of kernel calls is huge, so there is a lot of overhead of launching kernels. Is there a way to solve this problem correctly using only one kernel with synchronization in CUDA? Do you have any ideas?

Thanks