I have a problem with inter-block block synchronisation in CUDA. I’m launching a kernel with 2 blocks of 2 threads. In block 1 I have set a global memory array ‘arr1’ to a certain value, which is then to be used in block 2. The problem is that block 2 is then using ‘arr1’ but instead of seeing the updated values it’s seeing the old values. So to counter this I thought I could use the subroutine threadfence() to wait for arr1 to be updated before block 2 reads it. This doesn’t appear to work. Am I doing something wrong or have I misunderstood how threadfence() works?
Here’s the full code:
!--------------------------------------------------------------------
! MAIN PROGRAM
!--------------------------------------------------------------------
program test
use cudafor
use my_kernels
implicit none
integer :: i, j
! initialise device arrays
arr1 = 0.0
arr2 = 0.0
d_test = 0.0
! invoke kernel
call kernel1<<<2,2>>>()
! copy device array to host and print
h_test = d_test
print*
do j= 1, 2
do i= 1, 2
write(*,'(F7.2)',advance='no') h_test(j,i)
end do
write(*,*)
end do
print*
end program test
!--------------------------------------------------------------------
! MODULES
!--------------------------------------------------------------------
module my_kernels
use cudafor
implicit none
! arrays
real,device,dimension(2) :: arr1
real,device,dimension(2) :: arr2
real,device,dimension(2,2) :: d_test
real, dimension(2,2) :: h_test
contains
!--------------------------------------------------------------------
! KERNEL
!--------------------------------------------------------------------
attributes(global) subroutine kernel1()
integer :: i, j
real :: var1, var2
real, shared, dimension(2) :: shared_mem
! setup
i = threadidx%x
j = blockidx%x
shared_mem(i) = 1.0
var1 = 0.0
! carry out miscellaneous calculations in block 1
if(j == 1) then
var2= log(1.0/sqrt(shared_mem(1)))
var1 = var1 - var2
arr2(i) = var1
endif
!--------------------------------------------------------------------
! PROBLEM OCCURS HERE
!--------------------------------------------------------------------
if(j == 1) then
arr1(i) = 1
d_test(j,i) = arr1(i)
end if
! if threadfence works the way I think it should work, then it will
! wait for arr1 to be updated before carrying on?
call syncthreads()
call threadfence()
if(j == 2) then
d_test(j,i) = arr1(i) ! arr1 should be 1.0, but instead its 0.0
end if
end subroutine kernel1
end module my_kernels
The way this works is that I have a dimension(2,2) global memory array d_test which loads the values for the dimension(2) arr1 from each block.
I then copy d_test into the host array, h_test, and print it.
The output I’m supposed to get is this:
1.00 1.00
1.00 1.00
But instead I get this:
1.00 1.00
0.00 0.00
The first row corresponds to block 1, and the second row corresponds to block 2. So when I update the global memory arr1, both blocks should see that it gets updated to 1.0. But in the second block despite the fact that I’m using threadfence() to enforce a specific order, the second block sees the old values of 0.0.
This problem is made even more confusing by the fact that if I comment out other unrelated pieces of code (miscellaneous calculations in block 1), then it works. Sometimes I run it and seperate runs will either work or not work, randomly. What’s going on here? Is it a race condition or something?
Sorry if this looks complicated, but I tried to simplify my program as much as possible to get down to the basic problem so I could post it on this forum. Thanks for any help