Having trouble with inter-block communication

I’m trying to copy data from one block to another via global memory. Now I know you can only synchronise threads within a block, but I found a way to synchronise between blocks using a global memory barrier coded in my gpu_sync subroutine. This is guaranteed to work because I pulled it from an academic paper (http://eprints.cs.vt.edu/archive/00001087/01/TR_GPU_synchronization.pdf) and I’ve tested it in a different situation. The problem is that it doesn’t appear to be working in this particular situation and I don’t know why.

Here’s my code:

!-------------------------------------------------------------------- 
! MODULES 
!--------------------------------------------------------------------    

module my_kernels 
  use cudafor 
  implicit none 

  ! variables 
  real,device :: d_var1
  real,device :: d_var2
  real,device :: d_test
  real        :: h_test

  ! gpu_sync arrays
  integer,device,dimension(2) :: Arrayin
  integer,device,dimension(2) :: Arrayout
  
contains  
  
  !-------------------------------------------------------------------- 
  ! KERNEL 
  !--------------------------------------------------------------------    

  attributes(global) subroutine kernel1(iters) 

    integer :: i, j, index
    integer, intent(in), value :: iters
            
    ! setup thread and block id's 
    i  = threadidx%x 
    j  = blockidx%x 
    
    ! use global memory multiple times in block #1 (loop over iters times) 
    if(j == 1) then
       do index = 1,iters
          d_var2 = 1.0
       end do
    endif 
    
    ! If block #1, then set d_var1 as 1.0
    if(j == 1) then 
       d_var1 = 1.0
    end if 

    ! if gpu_sync works the way it should work, then it will 
    ! wait for d_var1 (global memory) to be updated before carrying on
    call syncthreads()
    call gpu_sync(i, j)
    call threadfence() 

    ! If block #2, copy value from d_var1 to d_test (which gets printed)
    if(j > 1) then 
       d_test = d_var1 ! d_var1 should be 1.0, but instead its still 0.0
    end if
        
  end subroutine kernel1  

  !-------------------------------------------------------------------- 
  ! GPU_SYNC
  !--------------------------------------------------------------------    
  
  attributes(device) subroutine gpu_sync(i, j) 
 
    integer, intent(in), value :: i, j

    ! only thread 1 is used for synchronisation 
    if(i == 1) Arrayin(j) = 1 

    if(j == 1) then 
       do while(Arrayin(i) /= 1) 
          ! wait for condition to be true 
       end do
       call syncthreads() 

       Arrayout(i) = 1
    end if

    if(i == 1) then 
       do while(Arrayout(j) /= 1) 
          ! wait for condition to be true 
       end do
    end if
    call syncthreads() 

  end subroutine gpu_sync
  
end module my_kernels

!-------------------------------------------------------------------- 
! MAIN PROGRAM 
!--------------------------------------------------------------------  

program test 
  use cudafor 
  use my_kernels 
  
  implicit none  
  integer :: i, j, iters

  print*, "set number of iterations for block 1 to loop through: "
  read(*,*) iters
    
  ! initialise variables and arrays 
  d_var1   = 0.0
  d_var2   = 0.0
  d_test   = 0.0 
  h_test   = 0.0
  Arrayin  = 0
  Arrayout = 0
  
  ! -------------- invoke kernel ----------------
  ! --------------------------------------------- 
  call kernel1 <<<2>>> (iters) 

  ! Copy device data to host and print
  h_test = d_test
  print*,h_test ! if it works correctly, h_test should be 1.0 not 0.0
  
end program test

All variables with a d_ prefix are stored in global memory and all variables with h_ prefix are stored in host memory. Arrayin and arrayout are arrays used in the gpu_sync subroutine.

  1. I start by running the kernel with 2 blocks of 2 threads each.
  2. In block 1, I set d_var2 = 1.0 ‘iters’ times so that block 1 performs more work than block 2.
  3. Block 2 goes straight to gpu_sync and waits until block 1 has finished.
  4. After block 1 has finished, it sets d_var1 = 1.0 and then calls gpu_sync so that both block 1 and block 2 can carry on.
  5. Block 2 then copies d_var1 to d_test
  6. After the kernel has finished, d_test gets copied to h_test and is then printed. If it works, then h_test = 1.0. If not then h_test = 0.0.

The only thing I change here is the number of iterations, ‘iters’, block 1 goes through. If I set it to something low like 10, then everything works fine and I get h_test = 1.0. If I set it to something high like 10000, the gpu_sync doesn’t appear to work and h_test = 0.0. Why does this happen?

Sorry if this code example is too long winded and complicated, I’ve tried to make it as simple as possible while still illustrating the problem.

I’ve been stuck at this problem for months so any help will be very much appreciated, thanks.

Hi Tom,

If I move the call to threadfence before the call to gpu_sync then I get the expected answers. My guess is that block 2 is exiting gpu_sync before block 1 issues the threadfence and block 2 is getting the old value.

I’m highly skeptical this method will work unless you can guarantee all blocks are active. What happens if a block hasn’t started executing? Blocks aren’t swapped out so you can get in a situation where the active blocks are sitting in a loop wait for an inactive block that will never run since there are no open multiprocessors. Also, since different devices have different numbers of multiprocessors, what may work on one device may hang on another.

Granted, I didn’t read the paper so these issues may have been addressed. Hopefully, you can find a way to make it work beyond a few blocks and threads.

  • Mat

By the way, “call kernel1 <<<2>>> (iters)” should be “call kernel1 <<<2,2>>> (iters)”, it didn’t display correctly due to html formatting.

I tried moving the call to threadfence before gpu_sync but that made no difference. I also tried putting threadfence before and after gpu_sync but it still didn’t work.

However, when I tried compiling with -ta=nvidia,nocache then it did work. It looks like the problem has something to do with caching, which I assumed threadfence would sort out. It works for this simple example but not for the main code I’m working on. Besides, this comes at the cost of disabling the cache.

When I look at various posts in the NVIDIA forums, most of them say that you should never attempt to do inter-block GPU communication because the only way to guarantee barrier synchronisation is by launching a new kernel. But that would be too slow and this academic paper suggests that I can have an inter-block barrier WITHIN the kernel. Perhaps then this method only works because they are using CUDA C and that it relies on features that aren’t available in CUDA Fortran yet?

By the way, thanks for warning me about the problem with guaranteeing all the blocks are active. I’m guessing that problem only occurs if I run more blocks than there are multiprocessors?

Fixed it now! I had another look at the academic paper and found in the footnotes what the … dots mean in the code example. Turns out that instead of putting nothing in the do while loop, you need to use an atomic function to guarantee that the compiler won’t “optimise out” the loop. In my code I replaced this:

       do while(Arrayin(i) /= 1) 
          ! wait for condition to be true 
       end do

with this:


       do while(Arrayin(i) /= 1) 
          d_count1(i,j) = atomicCAS(d_count1(i,j), 0, 1)
       end do

I think the compiler must have just completely ignored the do while loop because there was nothing in it. But if I use the atomicCAS operation on a global memory array, then it works.

Thanks Tom. Gotta love foot notes.

I’m guessing that problem only occurs if I run more blocks than there are multiprocessors?

I believe so. You’ll most likely want to set the number of blocks dynamically based on the device properties’ “multiProcessorCount”. Also, will the algorithm allow you to increase the number of threads per block? Having the number of threads equal the number of blocks will limit your performance.

  • Mat