sync all threads with threadfence?

syncthread() will only sync threads in the same block, If I want to create a check points for all threads on the device, should I use threadfense() ? for the following sample code:

attribute(global) subroutine MyTest_cuda
integer :: s, i, j

do s=1, 10

do i=do w=blockidx%x, 20, griddim%x
do j=threadidx%x, 20, blockdim%x 
......
enddo  ! j
enddo  ! i
call threadfense()
enddo !s
end subroutine

If I call the subroutine with

call MyTest_cuda<<<30,30>>>

Since the i, j loops are only up to 20. some threads will not go into the loop. Are they going to wait at ‘call threadfense()’? thanks for the help, Mat.

Hi tty103,

The only guaranteed method to achieve global synchronization is via multiple kernel launches. threadfense only makes sure that memory is globally visible.

You can try using it in combination with atomics to achieve but it’s my understanding that it can be very slow and has the potential for deadlocks. I have not attempted it myself since it’s not recommended, but in searching the NVIDIA forums you can find several implementations in C.

  • Mat