OpenMP teams distribute thread binding

I am using nvfortran and OpenMP to parallelize a series of sparse matrix-vector multiplications using the sliced ellpack format (SELL) , described in the work of Kreutzer et al. (2014) (PDF, 874 KB). The rows are distributed into chunks of length 32 to match the warp size.

The CPU and GPU versions of the kernel look as follows,

!
! LBM streaming via SpMV
!
! Evaluates the PDFs at the new timestep by semi-Lagrangian streaming
!
!  f^{t+1} = S( f^t )
!
! For each lattice direction, q \in {1,...,8}, we perform one 
! sparse_matrix vector multiplication. The algorithm can be summarized as:
!
! y_i0 = x_i0
!
! for q in 1, 8
!   y_iq = \sum_j A_ijq x_jq
!
! where y := f^{t+1}, x := f^t, and A_ijq, q \in {1,...,8} are the
! sparse matrices for different directions which share the same 
! sparsity pattern. Zero-based indexing is used.
!
subroutine sell_mv_cpu(nz,nc,np,a,ja,x,y)

   integer, parameter :: wp = kind(1.0d0)
   integer, parameter :: VL = 32

   integer, intent(in) ::nz, nc,np
   real(wp), intent(in) :: a(VL,nz,nc,0:8)
   integer, intent(in) :: ja(VL,nz,nc)
   real(wp), intent(in) :: x(0:np-1,0:8) ! 0:np-1
   real(wp), intent(out) :: y(np,0:8) ! 1:np

   real(wp) :: t(vl)
   integer :: i, j, k, q

   ! copy rest particles, q = 0
   y(:,0) = x(:,0)

   !$omp parallel default(private) shared(nc,a,ja,x,y) firstprivate(nz)
   !$omp do collapse(2) schedule(static) private(t)
   do q = 1, 8
      do k = 1, nc
         t = 0.0_wp
         do j = 1, nz
            !$omp simd
            do i = 1, VL
               t(i) = t(i) + a(i,j,k,q)*x(ja(i,j,k),q)
            end do
         end do
         y((k-1)*VL+1:(k-1)*VL+vl,q) = t
      end do
   end do
   !$omp end do
   !$omp end parallel

end subroutine


subroutine sell_mv_gpu(nz,nc,np,a,ja,x,y)

   integer, parameter :: wp = kind(1.0d0)
   integer, parameter :: VL = 32

! nz = nonzeros per row
! nc = number of chunks
! np = number of rows, padded; np = nc * VL
   integer, intent(in) ::nz, nc, np
   
! a = sparse matrix coefficients
! ja = sparsity pattern
   real(wp), intent(in) :: a(VL,nz,nc,0:8)
   integer, intent(in) :: ja(VL,nz,nc)

! x = right-hand side vector
! y = left-hand side vector
   real(wp), intent(in) :: x(0:np-1,0:8)  ! 0:np-1
   real(wp), intent(out) :: y(np,0:8)     ! 1:np

   real(wp) :: t(vl)
   integer :: i, j, k, q

   ! copy rest particles, q = 0
   !$omp target teams loop
   do i = 1, np
      y(i,0) = x(i-1,0)
   end do

   !$omp target teams loop collapse(2) bind(teams) private(t,i,j)
   do k = 1, nc
      do q = 1, 8
         !$omp loop bind(parallel)
         do i = 1, VL
            t(i) = 0.0_wp
         end do
         do j = 1, nz
            !$omp loop bind(parallel)
            do i = 1, VL
               t(i) = t(i) + a(i,j,k,q)*x(ja(i,j,k),q)
            end do
         end do
         !$omp loop bind(parallel)
         do i = 1, vl
            y((k-1)*VL+i,q) = t(i)
         end do
      end do
   end do

end subroutine

Compiling with nvfortran -mp=gpu -Minfo=mp -c sell_mv.f90, I have verified that the binding to the threads is as desired:

     77, !$omp target teams loop
         77, Generating "nvkernel_sell_mv_gpu__F1L77_5" GPU kernel
             Generating NVIDIA GPU code
           78, Loop parallelized across teams collapse(2) ! blockidx%x
           79,   ! blockidx%x collapsed
           81, Loop parallelized across threads(32) ! threadidx%x
           84, Loop run sequentially 
           86, Loop parallelized across threads(32) ! threadidx%x
           91, Loop parallelized across threads(32) ! threadidx%x
         77, Generating Multicore code
           78, Loop parallelized across threads
     77, CUDA shared memory used for t
         Generating implicit map(to:x(:,1:)) 
         Generating implicit map(tofrom:y(:nc*32,1:)) 
         Generating implicit map(to:a(:,j,:nc,1:),ja(:,j,:nc)) 
     79, Generating implicit private(j)

The collapsed loop over q and k are distributed across grid-blocks, and the inner loop over i is handled by the threads in a warp, in accordance with the SELL storage format.

This kernel appears to work well; on a RTX 2060 I’ve measured an effective bandwidth of 238 GB/s (about 70 % of available theoretical BW of 336 GB/s).

When I try to use the teams distribute construct, as shown below, I don’t appear to be able to bind the threads in a warp to the i-loop, and the performance is much worse for some reason.

Is there a hint I could provide or does Nvidia’s OpenMP implementation favor the use of !$omp loop?

   !$omp target teams distribute collapse(2) private(t,i,j)
   do k = 1, nc
      do q = 1, 8
         !$omp parallel do
         do i = 1, VL
            t(i) = 0.0_wp
         end do
         do j = 1, nz
            !$omp parallel do
            do i = 1, VL
               t(i) = t(i) + a(i,j,k,q)*x(ja(i,j,k),q)
            end do
         end do
         !$omp parallel do
         do i = 1, vl
            y((k-1)*VL+i,q) = t(i)
         end do
      end do
   end do

“distribute” allows a wider array of constructs, such a metadirectives, which requires delaying scheduling until runtime. Hence the compiler must “outline” the region which is passed to the runtime for final scheduling. This causes a bit of overhead and reduces the ability of the compiler to full optimize the code.

Since “loop” is more restrictive, the compiler is apply more optimization at compile time which often results in more performant code. Our recommendation is to use “loop” where possible and only use “distribute” when needed.

For some reason I thought that loop is less restrictive, giving the implementation more freedom. Does this mean that !$omp loop can’t be used with metadirectives?

Not related directly to the loop nests above, I’ve read somewhere that for scalar parameters (which typically includes array sizes) it is beneficial if they are passed by value (just like in CUDA kernels), and if not, they can be put in thefirstprivate() clause. This saves the runtime the trouble of having to do the implicit tofrom memory mapping. Is this information correct?

In the OpenMP 5.2 spec it says,

The data-sharing attributes of variables that are referenced in a region, but not in the corresponding construct, are determined as follows:

  • […]
  • Dummy arguments of called routines in the region that have the VALUE attribute are private.

Here are some examples to illustrate what I mean,

subroutine relax1(f,n,omega)
   integer, value :: n
   real, intent(inout) :: f(n)
   real, value :: omega
   integer :: i

   !$omp target teams loop map(tofrom: f)
   do i = 1, n
       f(i) = f(i) * (1.0 - omega)     ! omega is private, due to value attribute
   end do
end subroutine

subroutine relax2(f,n,omega)
   integer, value :: n
   real, intent(inout) :: f(n)
   real, intent(in) :: omega
   integer :: i

   !$omp target teams loop map(tofrom: f) firstprivate(omega)
   do i = 1, n
       f(i) = f(i) * (1.0 - omega)     ! omega is firstprivate
   end do
end subroutine

subroutine relax3(f,n,omega)
   integer, value :: n
   real, intent(inout) :: f(n)
   real, intent(in) :: omega
   integer :: i

   !$omp target teams loop map(tofrom: f)
   do i = 1, n
       f(i) = f(i) * (1.0 - omega)     ! omega is shared
   end do
end subroutine

The information shown is the same in all three cases:

$ nvfortran -c -mp=gpu -Minfo=all ../../gemm/test_relax.f90 
relax1:
      7, !$omp target teams loop
          7, Generating "nvkernel_relax1__F1L7_2" GPU kernel
             Generating NVIDIA GPU code
            8, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
          7, Generating Multicore code
            8, Loop parallelized across threads
      7, Generating map(tofrom:f(:)) 
relax2:
     19, !$omp target teams loop
         19, Generating "nvkernel_relax2__F1L19_4" GPU kernel
             Generating NVIDIA GPU code
           20, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
         19, Generating Multicore code
           20, Loop parallelized across threads
     19, Generating map(tofrom:f(:)) 
relax3:
     31, !$omp target teams loop
         31, Generating "nvkernel_relax3__F1L31_6" GPU kernel
             Generating NVIDIA GPU code
           32, Loop parallelized across teams, threads(128) ! blockidx%x threadidx%x
         31, Generating Multicore code
           32, Loop parallelized across threads
     31, Generating map(tofrom:f(:))

They can be used with loop provided the conditions can be evaluated at compile time. For runtime conditions, you need to use distribute.

I’ve read somewhere that for scalar parameters (which typically includes array sizes) it is beneficial if they are passed by value (just like in CUDA kernels), and if not, they can be put in thefirstprivate() clause. This saves the runtime the trouble of having to do the implicit tofrom memory mapping. Is this information correct?

Yes, but not here given the subroutines are outside of the compute region.

For subroutines called within the compute region, it is better to use “value” so the compiler can implicitly privatize the scalar arguments.