I am using nvfortran and OpenMP to parallelize a series of sparse matrix-vector multiplications using the sliced ellpack format (SELL) , described in the work of Kreutzer et al. (2014) (PDF, 874 KB). The rows are distributed into chunks of length 32 to match the warp size.

The CPU and GPU versions of the kernel look as follows,

```
!
! LBM streaming via SpMV
!
! Evaluates the PDFs at the new timestep by semi-Lagrangian streaming
!
! f^{t+1} = S( f^t )
!
! For each lattice direction, q \in {1,...,8}, we perform one
! sparse_matrix vector multiplication. The algorithm can be summarized as:
!
! y_i0 = x_i0
!
! for q in 1, 8
! y_iq = \sum_j A_ijq x_jq
!
! where y := f^{t+1}, x := f^t, and A_ijq, q \in {1,...,8} are the
! sparse matrices for different directions which share the same
! sparsity pattern. Zero-based indexing is used.
!
subroutine sell_mv_cpu(nz,nc,np,a,ja,x,y)
integer, parameter :: wp = kind(1.0d0)
integer, parameter :: VL = 32
integer, intent(in) ::nz, nc,np
real(wp), intent(in) :: a(VL,nz,nc,0:8)
integer, intent(in) :: ja(VL,nz,nc)
real(wp), intent(in) :: x(0:np-1,0:8) ! 0:np-1
real(wp), intent(out) :: y(np,0:8) ! 1:np
real(wp) :: t(vl)
integer :: i, j, k, q
! copy rest particles, q = 0
y(:,0) = x(:,0)
!$omp parallel default(private) shared(nc,a,ja,x,y) firstprivate(nz)
!$omp do collapse(2) schedule(static) private(t)
do q = 1, 8
do k = 1, nc
t = 0.0_wp
do j = 1, nz
!$omp simd
do i = 1, VL
t(i) = t(i) + a(i,j,k,q)*x(ja(i,j,k),q)
end do
end do
y((k-1)*VL+1:(k-1)*VL+vl,q) = t
end do
end do
!$omp end do
!$omp end parallel
end subroutine
subroutine sell_mv_gpu(nz,nc,np,a,ja,x,y)
integer, parameter :: wp = kind(1.0d0)
integer, parameter :: VL = 32
! nz = nonzeros per row
! nc = number of chunks
! np = number of rows, padded; np = nc * VL
integer, intent(in) ::nz, nc, np
! a = sparse matrix coefficients
! ja = sparsity pattern
real(wp), intent(in) :: a(VL,nz,nc,0:8)
integer, intent(in) :: ja(VL,nz,nc)
! x = right-hand side vector
! y = left-hand side vector
real(wp), intent(in) :: x(0:np-1,0:8) ! 0:np-1
real(wp), intent(out) :: y(np,0:8) ! 1:np
real(wp) :: t(vl)
integer :: i, j, k, q
! copy rest particles, q = 0
!$omp target teams loop
do i = 1, np
y(i,0) = x(i-1,0)
end do
!$omp target teams loop collapse(2) bind(teams) private(t,i,j)
do k = 1, nc
do q = 1, 8
!$omp loop bind(parallel)
do i = 1, VL
t(i) = 0.0_wp
end do
do j = 1, nz
!$omp loop bind(parallel)
do i = 1, VL
t(i) = t(i) + a(i,j,k,q)*x(ja(i,j,k),q)
end do
end do
!$omp loop bind(parallel)
do i = 1, vl
y((k-1)*VL+i,q) = t(i)
end do
end do
end do
end subroutine
```

Compiling with `nvfortran -mp=gpu -Minfo=mp -c sell_mv.f90`

, I have verified that the binding to the threads is as desired:

```
77, !$omp target teams loop
77, Generating "nvkernel_sell_mv_gpu__F1L77_5" GPU kernel
Generating NVIDIA GPU code
78, Loop parallelized across teams collapse(2) ! blockidx%x
79, ! blockidx%x collapsed
81, Loop parallelized across threads(32) ! threadidx%x
84, Loop run sequentially
86, Loop parallelized across threads(32) ! threadidx%x
91, Loop parallelized across threads(32) ! threadidx%x
77, Generating Multicore code
78, Loop parallelized across threads
77, CUDA shared memory used for t
Generating implicit map(to:x(:,1:))
Generating implicit map(tofrom:y(:nc*32,1:))
Generating implicit map(to:a(:,j,:nc,1:),ja(:,j,:nc))
79, Generating implicit private(j)
```

The collapsed loop over q and k are distributed across grid-blocks, and the inner loop over i is handled by the threads in a warp, in accordance with the SELL storage format.

This kernel appears to work well; on a RTX 2060 I’ve measured an effective bandwidth of 238 GB/s (about 70 % of available theoretical BW of 336 GB/s).

When I try to use the `teams distribute`

construct, as shown below, I don’t appear to be able to bind the threads in a warp to the i-loop, and the performance is much worse for some reason.

Is there a hint I could provide or does Nvidia’s OpenMP implementation favor the use of `!$omp loop`

?

```
!$omp target teams distribute collapse(2) private(t,i,j)
do k = 1, nc
do q = 1, 8
!$omp parallel do
do i = 1, VL
t(i) = 0.0_wp
end do
do j = 1, nz
!$omp parallel do
do i = 1, VL
t(i) = t(i) + a(i,j,k,q)*x(ja(i,j,k),q)
end do
end do
!$omp parallel do
do i = 1, vl
y((k-1)*VL+i,q) = t(i)
end do
end do
end do
```