program hangs when copying between host/device


I have a working version of a code using CUDA Fortran, but I made some changes on the kernel trying to improve the runtimes, and the new version hangs when copying from device to host. The working version also makes this copy, but the program does not hang.

All variables are inside modules, and there are some synchthreads() called inside the kernel. How could I identify the reason of why the program hangs? Is it a known problem?

! code on CPU host
A_device = A_host ! normal
call kernel<<<blocks,threads>>>()
A_host = A_device ! hangs


Hi Henrique Rennó,

and the new version hangs when copying from device to host.

I doubt it’s the copy that’s hanging, rather it’s more likely that it’s the kernel that’s hung since the host will block waiting for the kernel to end before performing the copy.

You’ll need to debug your code to determine the problem, either with a debugger like cuda-gdb, or using print statements.

If you can provide a reproducing example, I can take a look as well.


I think I found the problem, but I don’t know why it happens. Below is a simplified code that exemplifies how it works. There’s a while loop controled by a flag variable that repeats while the flag condition is satisfied. One of the threads is responsible for setting the flag to end the loop.

By making each thread print its number after the loop, I saw that with 16/128 blocks/threads, only 128 threads printed their number, and with 8/256 blocks/threads, 1024 printed it. However, the total number of active threads is near 2000, so many threads are still executing the loop.

! all variables are inside a module, so they are global to all threads
! 'use module_name' is used in the actual code
integer, device :: flag
real*8, device :: total, inc

idx = blockDim%x*(blockIdx%x-1)+threadIdx%x
if(idx > N) return ! N is the problem size

if(idx == 1)then
    total = 0.0
    flag = 1
call syncthreads()

do while (flag == 1)
    ! computations for threads here
    call syncthreads()
    if(idx == 1)then
        inc = compute some value from previous computations
        total = total+inc
        if(total >= TOTAL_LIMIT) flag = 0 ! all threads can leave the loop
    call syncthreads()

print *, 'thread', idx

Is it related to the update of the flag variable in cache? Is there a flush-like command that updates the flag for all threads?

Hi Henrique,

I think the problem here is that you really need to global synchronization to make this work correctly. “syncthreads” only synchronizes across threads within a single block, not globally. Also, not all blocks may be active at the same time and can be run in any order. Hence the block that contains “idx=1” may not be executed before the other blocks and hence “flags” may be uninitialized in the other blocks.

Is it related to the update of the flag variable in cache?

Yes, this could be another issue in that you can’t guarantee that updates to flag is visible to all threads. You might try adding a to “threadfence_system” call and adding the “volatile” attribute to flag so it always updates the global memory rather than a cached copy. Though, you’ll still have issues with global synchronization and we currently have a bug which causes volatile to be dropped with “-O2” or above, so you’d need to compile at “-O1”.


If you have a device with compute capability 7.0 or higher (i.e. Volta), you might be able to achieve global synchronization by giving the kernel a “grid_global” attribute combined with using cooperative groups with a grid_group type. This can synchronize across all threads in a grid. I haven’t used it before myself, but understand it may have an adverse impact on performance. Example at the bottom of this post.

Details can be found at:

Note that with “grid_global” only the number of blocks that can run concurrently on the device will be launched. Hence, if your problem size is bigger than this, you’ll need to make sure that the kernel can process more than one element given the fixed number of threads.

If you can, probably the best route would remove the use of “flag” so that each block is independent. Sans that, I’d look at using cooperative groups.

Hope this helps,

Example using grid_global and cooperative groups:

module coop
  attributes(grid_global) subroutine g1(a,b,n,some_offset)
    use cooperative_groups
    real, intent(inout) :: a(n), b(n)
    integer, value :: n, some_offset
    type(grid_group) :: gg
    gg = this_grid()
    do i = gg%rank, n, gg%size
       a(i) = min(max(a(i),0.0),100.0) + 0.5
    end do
    call syncthreads(gg)
    do i = gg%rank, n, gg%size
       j = i + some_offset
       if ( j = j - n
       b(i) = a(i) + a(j)
    end do
  end subroutine g1
end module coop

program m
  use cudafor
  use coop
  implicit none

  integer, parameter :: n = 2048
  real, allocatable :: a(:), b(:)
  real :: a_exp(n), b_exp(n)
  real, device, allocatable :: a_d(:), b_d(:)
  integer :: tPB = 256

  allocate(a(n), b(n))
  allocate(a_d(n), b_d(n))
  a = 2.00
  b = 1.00
  a_exp = 2.5
  b_exp = 5.0

  a_d = a
  b_d = b
  call g1<<<*, tPB>>>(a_d, b_d, n, 1)
  a = a_d
  b = b_d

  call checkf(a, a_exp, n)
  call checkf(b, b_exp, n)

end program m

Thanks a lot for your reply, I’ll take a look at everything. However, now I don’t understand why the working code that called the kernels with the same number of blocks and threads per block produced correct final results. If syncthreads() synchronizes only threads within a block, then I think the working code should not work due to the same problem.

After some debugging, I noticed that only threads within a block can get the value of a “shared” variable (in the example, the flag). What I need to do is to make all threads get the result of a reduction variable defined from values computed by each thread.

However, now I don’t understand why the working code that called the kernels with the same number of blocks and threads per block produced correct final results.

Not sure, though keep in mind I’m only making an educated guess here based on the the small snip-it of code you provided.

If possible, posting a full reproducing example would help better determine the issue.


The machines I’m using have Pascal and Volta with compute capability 6.1 and 7.0, respectively. Thus, it’s not possible to use something similar to cooperative groups with the Pascal GPU?

No sorry. Cooperative groups require hardware support so are only available with CC7.0 and later devices. Pascal is CC6.0.


The use of cooperative groups worked, but the performance was 2x worse than the performance achieved with the other code version.

Glad to hear that it worked. The performance degradation is not unexpected, but it’s better to have correct results rather than fast wrong results.

Again, the best route would be to rethink the algorithm so it doesn’t need global synchronization, but sans that, this is really the only way I know of to ensure correctness.