Cuda: Getting stuck in syncthreads() loop

I’m writing a kernel and have come across a rather unique situation. I have a logical variable in the global device memory called d_flag, which is initially set to ‘true’. I call this subroutine while inside the kernel:

  attributes(device) subroutine devicesub (tid)

    integer, intent(in), value :: tid
    
    if (tid.eq.1) then    

       d_flag = .false.
       call syncthreads()

    else

       do while (d_flag)
          call syncthreads()
       end do

    endif

  end subroutine devicesub

If I launch a kernel with 4 threads for example, when I call this subroutine the first thread is supposed to go through the first conditional branch while the other 3 threads go through the ‘else’ branch. These 3 threads are supposed to keep calling syncthreads() until d_flag is set to false. But this doesn’t happen, when I run the program it just gets stuck in this ‘do while’ loop forever.

Please let me know where I’m going wrong, thanks.

Hi Tom,

I see a few issues. First, memory between threads is not always coherent. Hence the value of “d_flag” may not be consistent between threads. To force coherency, you need to use atomics. The warp vote functions should work as well.

The biggest issue with this code is your use of syncthreads. All threads in a thread block must reach and execute the same syncthreads call statement. You have different threads executing different statements and a different number of statements. Instead, the code should be more like:

attributes(device) subroutine devicesub (tid)

    integer, intent(in), value :: tid
   
    if (tid.eq.1) then   
       d_flag = atomicxor(d_flag,0)
    else
       do while (d_flag)
          ! SPIN WAIT
       end do
    endif
     call syncthreads()

  end subroutine devicesub

Granted I haven’t tested the above code, but this is the basic idea.

Hope this helps,
Mat

The atomicxor doesn’t work, it says:

PGF90-S-0155-Could not resolve generic procedure atomicxor (synctest.f90: 43)

This might be an issue with the compute capability, as it’s 1.1 on this device.

The point of the do while loop is to carry out a new syncthreads() each time the first thread encounters syncthreads() in the other conditional branch. I thought that syncthreads() was global, so that it doesn’t matter at which point in the code (i.e conditional branch) the thread encounters it.

Why isn’t d_flag consistent between different threads? It’s stored in global memory, so shouldn’t all the threads see the same value for d_flag?

Thanks

The error is because d_flag needs to be an integer. I’m guessing in your case it’s a logical.


This might be an issue with the compute capability, as it’s 1.1 on this device.

No CC1.1 is fine so long as the first argument is a global device variable. CC1.2 allows the argument to be shared or global.


The point of the do while loop is to carry out a new syncthreads() each time the first thread encounters syncthreads() in the other conditional branch. I thought that syncthreads() was global, so that it doesn’t matter at which point in the code (i.e conditional branch) the thread encounters it.

No. Every thread in a thread block must issue the same call to syncthreads.

Why isn’t d_flag consistent between different threads? It’s stored in global memory, so shouldn’t all the threads see the same value for d_flag?

Caching.

  • Mat