Hi Henrique,
I think the problem here is that you really need to global synchronization to make this work correctly. “syncthreads” only synchronizes across threads within a single block, not globally. Also, not all blocks may be active at the same time and can be run in any order. Hence the block that contains “idx=1” may not be executed before the other blocks and hence “flags” may be uninitialized in the other blocks.
Is it related to the update of the flag variable in cache?
Yes, this could be another issue in that you can’t guarantee that updates to flag is visible to all threads. You might try adding a to “threadfence_system” call and adding the “volatile” attribute to flag so it always updates the global memory rather than a cached copy. Though, you’ll still have issues with global synchronization and we currently have a bug which causes volatile to be dropped with “-O2” or above, so you’d need to compile at “-O1”.
See: https://www.pgroup.com/resources/docs/19.10/x86/cuda-fortran-prog-guide/index.htm#cfref-dev-code-new-intrinsic-funcs
If you have a device with compute capability 7.0 or higher (i.e. Volta), you might be able to achieve global synchronization by giving the kernel a “grid_global” attribute combined with using cooperative groups with a grid_group type. This can synchronize across all threads in a grid. I haven’t used it before myself, but understand it may have an adverse impact on performance. Example at the bottom of this post.
Details can be found at:
https://www.pgroup.com/resources/docs/19.10/x86/cuda-fortran-prog-guide/index.htm#cfpg-sub-func-attr-grid-global
https://www.pgroup.com/resources/docs/19.10/x86/cuda-fortran-prog-guide/index.htm#cfref-fort-mods-dev-mod-coopgr
Note that with “grid_global” only the number of blocks that can run concurrently on the device will be launched. Hence, if your problem size is bigger than this, you’ll need to make sure that the kernel can process more than one element given the fixed number of threads.
If you can, probably the best route would remove the use of “flag” so that each block is independent. Sans that, I’d look at using cooperative groups.
Hope this helps,
Mat
Example using grid_global and cooperative groups:
module coop
contains
attributes(grid_global) subroutine g1(a,b,n,some_offset)
use cooperative_groups
real, intent(inout) :: a(n), b(n)
integer, value :: n, some_offset
type(grid_group) :: gg
gg = this_grid()
do i = gg%rank, n, gg%size
a(i) = min(max(a(i),0.0),100.0) + 0.5
end do
call syncthreads(gg)
do i = gg%rank, n, gg%size
j = i + some_offset
if (j.gt.n) j = j - n
b(i) = a(i) + a(j)
end do
return
end subroutine g1
end module coop
program m
use cudafor
use coop
implicit none
integer, parameter :: n = 2048
real, allocatable :: a(:), b(:)
real :: a_exp(n), b_exp(n)
real, device, allocatable :: a_d(:), b_d(:)
integer :: tPB = 256
allocate(a(n), b(n))
allocate(a_d(n), b_d(n))
a = 2.00
b = 1.00
a_exp = 2.5
b_exp = 5.0
a_d = a
b_d = b
call g1<<<*, tPB>>>(a_d, b_d, n, 1)
a = a_d
b = b_d
call checkf(a, a_exp, n)
call checkf(b, b_exp, n)
end program m