I’m trying to copy data from one block to another via global memory. Now I know you can only synchronise threads within a block, but I found a way to synchronise between blocks using a global memory barrier coded in my gpu_sync subroutine. This is guaranteed to work because I pulled it from an academic paper (http://eprints.cs.vt.edu/archive/00001087/01/TR_GPU_synchronization.pdf) and I’ve tested it in a different situation. The problem is that it doesn’t appear to be working in this particular situation and I don’t know why.
Here’s my code:
!--------------------------------------------------------------------
! MODULES
!--------------------------------------------------------------------
module my_kernels
use cudafor
implicit none
! variables
real,device :: d_var1
real,device :: d_var2
real,device :: d_test
real :: h_test
! gpu_sync arrays
integer,device,dimension(2) :: Arrayin
integer,device,dimension(2) :: Arrayout
contains
!--------------------------------------------------------------------
! KERNEL
!--------------------------------------------------------------------
attributes(global) subroutine kernel1(iters)
integer :: i, j, index
integer, intent(in), value :: iters
! setup thread and block id's
i = threadidx%x
j = blockidx%x
! use global memory multiple times in block #1 (loop over iters times)
if(j == 1) then
do index = 1,iters
d_var2 = 1.0
end do
endif
! If block #1, then set d_var1 as 1.0
if(j == 1) then
d_var1 = 1.0
end if
! if gpu_sync works the way it should work, then it will
! wait for d_var1 (global memory) to be updated before carrying on
call syncthreads()
call gpu_sync(i, j)
call threadfence()
! If block #2, copy value from d_var1 to d_test (which gets printed)
if(j > 1) then
d_test = d_var1 ! d_var1 should be 1.0, but instead its still 0.0
end if
end subroutine kernel1
!--------------------------------------------------------------------
! GPU_SYNC
!--------------------------------------------------------------------
attributes(device) subroutine gpu_sync(i, j)
integer, intent(in), value :: i, j
! only thread 1 is used for synchronisation
if(i == 1) Arrayin(j) = 1
if(j == 1) then
do while(Arrayin(i) /= 1)
! wait for condition to be true
end do
call syncthreads()
Arrayout(i) = 1
end if
if(i == 1) then
do while(Arrayout(j) /= 1)
! wait for condition to be true
end do
end if
call syncthreads()
end subroutine gpu_sync
end module my_kernels
!--------------------------------------------------------------------
! MAIN PROGRAM
!--------------------------------------------------------------------
program test
use cudafor
use my_kernels
implicit none
integer :: i, j, iters
print*, "set number of iterations for block 1 to loop through: "
read(*,*) iters
! initialise variables and arrays
d_var1 = 0.0
d_var2 = 0.0
d_test = 0.0
h_test = 0.0
Arrayin = 0
Arrayout = 0
! -------------- invoke kernel ----------------
! ---------------------------------------------
call kernel1 <<<2>>> (iters)
! Copy device data to host and print
h_test = d_test
print*,h_test ! if it works correctly, h_test should be 1.0 not 0.0
end program test
All variables with a d_ prefix are stored in global memory and all variables with h_ prefix are stored in host memory. Arrayin and arrayout are arrays used in the gpu_sync subroutine.
- I start by running the kernel with 2 blocks of 2 threads each.
- In block 1, I set d_var2 = 1.0 ‘iters’ times so that block 1 performs more work than block 2.
- Block 2 goes straight to gpu_sync and waits until block 1 has finished.
- After block 1 has finished, it sets d_var1 = 1.0 and then calls gpu_sync so that both block 1 and block 2 can carry on.
- Block 2 then copies d_var1 to d_test
- After the kernel has finished, d_test gets copied to h_test and is then printed. If it works, then h_test = 1.0. If not then h_test = 0.0.
The only thing I change here is the number of iterations, ‘iters’, block 1 goes through. If I set it to something low like 10, then everything works fine and I get h_test = 1.0. If I set it to something high like 10000, the gpu_sync doesn’t appear to work and h_test = 0.0. Why does this happen?
Sorry if this code example is too long winded and complicated, I’ve tried to make it as simple as possible while still illustrating the problem.
I’ve been stuck at this problem for months so any help will be very much appreciated, thanks.