CUDA Fortran - global synchronization

Hi all,

is it possible to synchronize the entire CUDA grid using CUDA Fortran? Documentation is very misleading about this. On the one hand, there is this new grid_global attributes that should allow doing just that. However, I’ve found the following information that clearly contradicts such functionality: “There is currently limited functionality for cooperative groups of size less than or equal to a thread block”.

I’m confused. Syncing threads within a block was possible before cc70. If one can not go beyond that what grid_global kernels are good for anyways?

Could someone clarify this issue for me?

Best
Bart

Yes, you can synchronize all the threads in a grid in CUDA Fortran. As you have below, if you launch a kernel that has the grid_global attribute, you can then create a cooperative group grid derived type, and pass it to syncthreads, like this:

attributes(grid_global) subroutine g()
use cooperative_groups
type(grid_group) :: gg
gg = this_grid()
. . .
call syncthreads(gg)

Before this, there was no clean way to synchronize threads across different blocks. In part, the grid_global launch ensures that the number of blocks is limited such that they can all remain resident on the GPU and sync’ing them is possible.

Currently, we also support sync’ing all the threads in a block (equivalent to the traditional syncthreads() call), and all the threads in a warp (which was sometimes assumed via warp-synchronous programming).

What we don’t currently support in CUDA Fortran, is the ability to create a group of an arbitrary number of threads (like 4 for instance) and provide a way to sync just those 4. That may come in future releases, and the support for that in CUDA C++ is still maturing as well.

Hi brentl,

thanks for a quick answer. Just to clarify, if I launch a 1 dimensional grid consisting of 2 blocks (each having only 1 thread), this will synchronize.

To put it differentially, launching a kernel like this call <<<1,2>>>(…) should produce the same result as call <<<2,1>>>(…).
The only difference between the two is that the latter call requires a global synchronization, the former one does not.

I’ve tested that and it does not work. The order does matter.

Best
Bart

Interesting. Well, I’d have to see your kernel. I haven’t done an experiment like that with only two threads.

Well, lets take a simple kernel like this,

attributes(grid_global) subroutine test_sync(y)
   use cooperative_groups, only : grid_group, this_grid
   implicit none
   integer :: N
   integer, intent(inout) :: y(:)
   integer :: idx, i
   type(grid_group) :: gg

        N = size(y,1)
        gg = this_grid()                              
        idx = threadIdx%x + blockDim%x * (blockIdx%x-1) 

        if (idx <= N) then
            y(idx) = idx
            do i=1, N      
               if (idx > 1) y(idx) = y(idx-1)
               call syncthreads(gg)
            end do
        end if

 end subroutine test_sync

Now, you can either launch it with 1 block of N threads:

! There is only one block (no need for global sync)
t_block = dim3(N, 1, 1)
grid = dim3(1, 1, 1)
call test_sync<<<grid, t_block>>>(y_d)

or using N blocks, having 1 thread each:

! This requires global sync to work
t_block = dim3(1, 1, 1)
grid = dim3(N, 1, 1)
call test_sync<<<grid, t_block>>>(y_d)

It should not matter which one we choose. Unfortunately, it does. Only the first configuration produces a correct answer, which is y_d = (1, …, 1). What that means is that the global sync does not work. Am I wrong?

Compilation option: pgfortran -Mcuda=cuda9.2,cc70

Thanks
Bart

UPDATE:
The same problem persists even for a much simpler example:

    attributes(grid_global) subroutine test_sync_2(y, N)
        implicit none
        integer, value :: N
        integer, device, intent(inout) :: y(N)
        integer :: i, j, z
        type(grid_group) :: gg

        gg = this_grid()

        do i = gg%rank, N, gg%size
           y(i) = i
           call syncthreads(gg)
           z = y(i)-y(max(1,i-1))
           call syncthreads(gg)
           y(i) = z
        end do
    end subroutine test_sync_2

This kernel computes a discrete derivative of an array y (here y(i)=i), which should result in y being (0,11,…1). To make sure a persistent thread evaluation one can call the kernel like this

call test_sync_2<<<*, N>>>(y_d,N)

where a global sync is not necessary or

call test_sync_2<<<*, 1>>>(y_d,N)

where a global sync is in fact required. Again, the result should be the same, but it is not.

Best
Bart

Yes, interesting. Your behavior (which I can recreate) is not what I would have expected. I’m checking with some others here.

Oh, I see now.
The overloading of syncthreads(), to take a grid_group argument, is in the cooperative_groups module. When you say use only, it is not getting properly defined, and it looks like we call normal syncthreads(). That makes sense (even if it is less than obvious). To get your tests to run, change the 2nd line in subroutine test_sync to
use cooperative_groups

Hey brentl,

Thank you for your help in this matter. Indeed, that solves the problem. Using cooperative_groups like this

cooperative_groups, only : syncthreads, grid_group, this_grid

also works. Now, at least, I understand the origin the problem. What I do not understand is why the standard syncthreads does not report any error when called with an extra argument. I would classify this as a bug.

I have also encountered another issue which is equally disturbing. Compare the following two versions of the same piece of code. The first one:

        do i = gg%rank, N, gg%size
           y(i) = i
           call syncthreads(gg)
           z = y(i)-y(max(1,i-1))
           call syncthreads(gg)
           y(i) = z
        end do

which does not work (generate an infinite loop or something) and

 
        do i = gg%rank, N, gg%size
           y(i) = i
        end do
        call syncthreads(gg)
        do i = gg%rank, N, gg%size
           z = y(i)-y(max(1,i-1))
        end do
        call syncthreads(gg)
        do i = gg%rank, N, gg%size
           y(i) = z
        end do

which synchronizes perfectly.

I have written similar codes in C++ and, as expected, everything works just fine. All that makes me think that the CUDA Fortran implementation of global sync capabilities is far from optimal at this point. There is too much uncertainty to use CUDA Fortran for professional applications.

Best
Bart

In your first case, you are not guaranteed that every thread call grid sync, or call it the same number of times.

In the second case, you are.

This is just not true! And again, I have the same code written in C++ and it works!

You can call the first kernel with exactly N treads or even replace the strided loop with if instruction and it still does not work.
Seriously, this is ridiculous.

Not sure what to tell you. This works for me:

attributes(grid_global) subroutine test_sync(y, N)
use cooperative_groups
implicit none
integer, value :: N
integer, intent(inout) :: y(*)
integer :: i, z
type(grid_group) :: gg
gg = this_grid()
do i = gg%rank, N, gg%size
y(i) = i
call syncthreads(gg)
z = y(i) - y(max(1,i-1))
call syncthreads(gg)
y(i) = z
end do
end subroutine test_sync
end module

Well, it does not work for me. If it does for you, it would be nice to learn more details. For instance:

  1. your operation system
  2. GPU you using
  3. cuda toolkit version
  4. pgfortran version
  5. compilation flags
  6. how you launch your kernel

If you could send me your entire fortran program that would definitely help understand what is going on.

Thank you
Bart

I’ve run with 18.10 and with 19.1, on a V100.

brentl@epyc2:~/tmp> pgf90 uf2.cuf
brentl@epyc2:~/tmp> ./a.out
0 1 1 1
0 1 1 1
brentl@epyc2:~/tmp> v
/proj/pgi/linux86-64/18.10/bin/pgf90
brentl@epyc2:~/tmp> cat uf2.cuf
module m
contains
attributes(grid_global) subroutine test_sync(y, N)
use cooperative_groups
implicit none
integer, value :: N
integer, intent(inout) :: y(*)
integer :: idx, i, z
type(grid_group) :: gg
gg = this_grid()
do i = gg%rank, N, gg%size
y(i) = i
call syncthreads(gg)
z = y(i) - y(max(1,i-1))
call syncthreads(gg)
y(i) = z
end do
end subroutine test_sync
end module

program p
use m
use cudafor
integer, managed :: y(4)
call test_sync<<<4,1>>>(y, 4)
istat = cudaDeviceSynchronize()
print *,y
call test_sync<<<1,4>>>(y, 4)
istat = cudaDeviceSynchronize()
print *,y
end

OK, so now I understand what is going on. This

test_sync<<<N,1>>>(y,N)

works (as in your example). However,

test_sync<<<*,1>>>(y,N)

does not (as in my example). In particular, your code only works if the number of blocks is exactly N (i.e. when the loop is not really strided). For instance, this

test_sync<<<N+1,1>>>(y,N)

will not work. It makes me wonder if this is intended since similar code written in C++ works just fine.

The “*” syntax in the launch is just basically shorthand for “call cudaOccupancyMaxActiveBlocksPerMultiprocessor and stick in the value it returns.” That is probably a very large number for small kernels like this.

Maybe CUDA C++ knows something we don’t know in the PGI launch code so that extra blocks are not launched. Not sure how you know what extra is, though.

If you launch with a “*” and just do this in your kernel:

print *,gg%rank, gg%size

I get 2560 threads launched.

I still think if not all threads call syncthreads(gg), you will get a hang. Just like traditional syncthreads() for threads in a block.

Oh, and speaking of the traditional syncthreads, we did open a bug report on passing an argument to syncthreads() when a module which overloads it is not enabled. FS#26848.

Sure, with “*” I launch the number of block that can reside simultaneously on the GPU. I agree, that number for all tested cases was much bigger that N. But this alone should not be a problem. Moreover, similar happens when the kernel is called with less threads than N.

I agree, CUDA C++ code is to be understood… As to you second point, there is actually a simple way to test it. Namely, by changing

gg = this_grid()
do i = gg%rank, N, gg%size
...
end do

to

do i = gg%rank, N, gg%size
gg = this_grid()
...
end do

we assure (don’t we?) that all threads can reach syncthreads(gg). Thus, if you are right this should work. Unfortunately, it does not. How do you explain this?

I just might need a little more context in your examples. I think:

  1. If you use “*” for grid, and in my example, 2560 grids are launched, each with 1 thread, I think you still need the 2560 threads to call syncthreads(gg) for it not to hang.
  2. If you use a cooperative group from a block (the traditional cuda group) every thread in a block needs to call syncthreads()
  3. If you use a cooperative group that is a warp (coalesced group), currently all 32 threads need to call syncthreads(cg).
    There might be some minor differences between CUDA Fortran and CUDA C in what happens with threads that exit the kernel. But that is AFAIK undefined behavior that you should not count on.