Cuda: threads over 2 warps not synchronising correctly

I launch a kernel with 64 threads in 1 block, such that I have 2 warps of 32 threads. I have a shared memory variable called shared_test which should be the same no matter what the thread number is. The problem is that when I assign a value to this shared variable using just the first thread, the first 32 threads see that change but the last 32 threads don’t. Here is the kernel code:


  attributes(global) subroutine kernel_test()

    integer, shared :: shared_test
    integer :: tid

    ! thread id
    tid = threadIdx%x

    ! initialise shared_test
    shared_test = 0
    call syncthreads()
    
    !---------------------------------------------------------       
    ! problem
    !---------------------------------------------------------       

    if (tid.eq.1) then

       call syncthreads()
       shared_test = 1
       
    else
       ! do nothing
    endif

    !---------------------------------------------------------       
    ! sync, copy shared to global
    !---------------------------------------------------------           
    call syncthreads()    
    d_test(tid) = shared_test
    
  end subroutine kernel_test

When I copy the device array (d_test) into host memory and then print all 64 elements, the output I get is this:

dbg: test array
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0

So the first 32 threads see the correct value of shared_test, the last 32 are wrong. I’ve tried putting in syncthreads() everywhere in and out of the conditional branches, but it doesn’t work unless I comment out the syncthreads() BEFORE shared_test=1, then I get 1’s for all threads. Why is this the case?

This code is a simple demonstration of the fundemental problem in the actual code that I’m using. I need to be able to have a syncthreads() before I set shared_test=1 for reasons that go beyond this demonstration.

Thanks for the help

Hi Tom,

Can you please post the full example? I tried to recreate your issue but it works for me. I suspect there is something else going on. Note, I did take out the extraneous synthreads in the if statement.

  • Mat
% cat test.cuf
module foo

use cudafor

integer, allocatable, dimension(:),device :: d_test
integer, allocatable, dimension(:) :: d

contains

  attributes(global) subroutine kernel_test()

    integer, shared :: shared_test
    integer :: tid

    ! thread id
    tid = threadIdx%x

    ! initialise shared_test
    shared_test = 0
    call syncthreads()

    if (tid.eq.1) then
       shared_test = 1
    endif
    call syncthreads()   
    d_test(tid) = shared_test
   
  end subroutine kernel_test

end module foo

program testme

use foo

allocate(d(64),d_test(64))
d=0
d_test=0
call kernel_test<<<1,64>>>()
d=d_test
print *, d

end program testme
 
% pgf90 test.cuf -V11.5 ; a.out
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1            1            1 
            1            1            1            1

But that’s how the problem occurs, when I have the syncthreads() placed just before shared_test = 1. I need to know why this occurs, as I can’t see any obvious explanation. I could just fix it by removing that syncthreads() statement, but it has to be there for my other code (which is too complicated to explain and post here) to work.

Hi Tom,

From NVIDIA’s CUDA Programming Guide:

__syncthreads() is allowed in conditional code but only if the conditional evaluates identically across the entire thread block, otherwise the code execution is likely to hang or produce unintended side effects.

CUDA Fortran follows the same rule. Unfortunately, you will need to change your algorithm since what you what to do simply wont work.

  • Mat

Thanks, that very useful information has just explained what was wrong the code that I’ve been trying to debug for about 4 months!

I found the solution here: The Official NVIDIA Forums | NVIDIA

If you ever want to do DIVIDE block into 2 groups and SYNCHRONIZE within themselves then you can still do it (virtually) by DIVIDING the IF-ELSE construct into multiple IF-ELSE construct (all dividing the blocks in the same fashion) with __syncthreads() in between each of them.

So my problem was fixed by taking the syncthreads() OUT of the single if-else statement, splitting up the if-else into multiple if-else statements and putting the syncthreads in between.

Great. I’ll try to remember that as well.

  • Mat