Different cuda blocks see different values for global memory

I have a problem with inter-block block synchronisation in CUDA. I’m launching a kernel with 2 blocks of 2 threads. In block 1 I have set a global memory array ‘arr1’ to a certain value, which is then to be used in block 2. The problem is that block 2 is then using ‘arr1’ but instead of seeing the updated values it’s seeing the old values. So to counter this I thought I could use the subroutine threadfence() to wait for arr1 to be updated before block 2 reads it. This doesn’t appear to work. Am I doing something wrong or have I misunderstood how threadfence() works?

Here’s the full code:


!--------------------------------------------------------------------
! MAIN PROGRAM
!--------------------------------------------------------------------  

program test
  use cudafor
  use my_kernels
  
  implicit none  
  integer :: i, j
  
  ! initialise device arrays
  arr1 = 0.0
  arr2 = 0.0
  d_test = 0.0

  ! invoke kernel
  call kernel1<<<2,2>>>()
  
  ! copy device array to host and print
  h_test = d_test
  print*
  do j= 1, 2
     do i= 1, 2
        write(*,'(F7.2)',advance='no') h_test(j,i)
     end do
     write(*,*)
  end do
  print*
  
end program test

!--------------------------------------------------------------------
! MODULES
!--------------------------------------------------------------------    

module my_kernels
  use cudafor
  implicit none

  ! arrays
  real,device,dimension(2)   :: arr1
  real,device,dimension(2)   :: arr2
  real,device,dimension(2,2) :: d_test
  real,       dimension(2,2) :: h_test

contains  
  
  !--------------------------------------------------------------------
  ! KERNEL
  !--------------------------------------------------------------------    

  attributes(global) subroutine kernel1()

    integer :: i, j
    real :: var1, var2
    real, shared, dimension(2) :: shared_mem
            
    ! setup
    i  = threadidx%x
    j  = blockidx%x
    shared_mem(i) = 1.0
    var1 = 0.0
    
    ! carry out miscellaneous calculations in block 1
    if(j == 1) then
       var2= log(1.0/sqrt(shared_mem(1)))
       var1  = var1 - var2
       arr2(i) = var1
    endif
    
    !--------------------------------------------------------------------
    ! PROBLEM OCCURS HERE
    !--------------------------------------------------------------------   
    if(j == 1) then
       arr1(i) = 1
       d_test(j,i) = arr1(i)
    end if

    ! if threadfence works the way I think it should work, then it will
    ! wait for arr1 to be updated before carrying on?
    call syncthreads()
    call threadfence()
    
    if(j == 2) then
       d_test(j,i) = arr1(i) ! arr1 should be 1.0, but instead its 0.0
    end if
        
  end subroutine kernel1  
  
end module my_kernels

The way this works is that I have a dimension(2,2) global memory array d_test which loads the values for the dimension(2) arr1 from each block.
I then copy d_test into the host array, h_test, and print it.

The output I’m supposed to get is this:

1.00 1.00
1.00 1.00

But instead I get this:

1.00 1.00
0.00 0.00

The first row corresponds to block 1, and the second row corresponds to block 2. So when I update the global memory arr1, both blocks should see that it gets updated to 1.0. But in the second block despite the fact that I’m using threadfence() to enforce a specific order, the second block sees the old values of 0.0.

This problem is made even more confusing by the fact that if I comment out other unrelated pieces of code (miscellaneous calculations in block 1), then it works. Sometimes I run it and seperate runs will either work or not work, randomly. What’s going on here? Is it a race condition or something?

Sorry if this looks complicated, but I tried to simplify my program as much as possible to get down to the basic problem so I could post it on this forum. Thanks for any help

Hi Tom,

All a threadfence does is flush the thread’s cache back to global memory. Threads do not block and the only guarantee is that reads made after the threadfence will get the new value. In your case, the threads in block 2 read arr1 and are retired long before the threads in block 1 reach the threadfence since block 1 has significantly more work.

You can increase in the chance that block 2 will see the new value if you can have block 2 match the same amount of work so that it’s read to arr1 is made after the threadfence. Though, this is not guaranteed and will depend upon how the blocks are scheduled.

You might consider running two separate kernels since this is currently the only way to guarantee global synchronization across blocks.

  • Mat

Thanks. I think I’ve found a solution that doesn’t require relaunching the kernel, however, using fast barrier inter-block synchronisation:
http://eprints.cs.vt.edu/archive/00001087/01/TR_GPU_synchronization.pdf
I’m going to give this a go

I read the paper and converted the gpu_sync() into Fortran. Here’s the code:


  attributes(device) subroutine gpu_sync(goalVal)

    integer :: tid, bid, nBlockNum
    integer, intent(in), value :: goalVal

    ! thread ID in a block
    tid = threadIdx%x
    bid = blockIdx%x
    nBlockNum = gridDim%x

    ! only thread 1 is used for synchronisation
    if(tid == 1) Arrayin(bid) = goalVal

    if(bid == 1) then
       if(tid <= nBlockNum) then
          do while(Arrayin(tid) /= goalVal)
             ! wait for condition to be true
          end do
       end if
       call syncthreads()

       if(tid <= nBlockNum) Arrayout(tid) = goalVal
    end if

    if(tid == 1) then
       do while(Arrayout(bid) /= goalVal)
          ! wait for condition to be true
       end do
    end if
    call syncthreads()

  end subroutine gpu_sync

So long as I call gpu_sync() followed by threadfence(), then it works (i.e the global memory is read after its updated). This is probably the fastest way to communicate between blocks.