Time gaps between OpenACC kernels

Hi,

I’m porting a Fortran code using OpenACC on GH200 (on Linux), and noticed large delays between successive kernels calls. I’m using nvhpc 24.7. The Nsight Systems report is here: gaps.nsys-rep - Google Drive

Each kernel is contained inside a subroutine, and called immediately after one another, so there should be no data transfer to/from host. I’m not sure what’s causing this. I’m also using -gpu=unified.

From this answer cuda - Reducing time to launch kernels in time stepping loop - OpenACC - Stack Overflow, I also used async(1) at the end of each !$acc parallel loop, but it doesn’t seem to help much.

Hi n626,

I took a look at the profile and I’m not seeing any data movement that would account for this. There’s a slight bit of launch overhead, but it seems more likely that there’s something going on on the host that’s causing the gaps.

If you can provide a reproducing example, I can take a look to see if I spot anything.

My answer over of SO was for cases where the launch overhead is high and using “async” can sometimes help hide this as the CPU can start launching the next kernel before the first one is finished. Once you determine what on the host is causing the gaps, then it might help a bit, but as it’s only 5-9us per launch on GH200, probably not much. It was more beneficial on older devices where the launch overhead was higher.

-Mat

Hi Mat,

Thank you for looking at this. I’ve created a self-contained example from one of the kernels in the code here (main.f90) : CUDA - Google Drive
The profiling is in example.nsys-rep.

I’ve also created a CUDA version of the subroutine. I’m just calling these repeatedly in a loop without any host activity in between, but it looks like the OpenACC version still has larger gaps than the CUDA version due to ioctl.

This is more pronounced if the array size gets smaller, e.g. nn = 3000. The time between each launch (in the CUDA API row of nsys) is about 30 us.

The only thing I’m seeing is that there’s about 2us when it’s entering and exiting the implicit data region (add “openacc” to nsys trace options to see the OpenACC runtime calls). I can get rid of this by adding “default(present)” on the parallel loop construct so the compiler knows all data is present on the device.

All remaining time seems to be the loop overhead and some of the call overhead. This can be measured by instrumenting the code with NVTX.

Try using this version to see if the gaps remain and if so, the NVTX ranges will give us an idea where in the host code that’s causing it.

Note that I also added a “wait” directive so the CUDA Fortran calls didn’t start before the OpenACC section is finished.

module util
  implicit none

  integer :: ne, dof, irk
  real, allocatable, dimension(:,:,:) :: ze,qx,qy,rhs_ze,rhs_qx,rhs_qy
contains

  subroutine init(n, d, r)
    implicit none

    integer, value :: n, d, r
    integer :: k, j

    ne = n
    dof = d
    irk = r

    if (.not. allocated(ze)) then
       allocate (ze(dof,ne,irk), qx(dof, ne,irk), qy(dof, ne,irk))
       allocate (rhs_ze(dof,ne,irk), rhs_qx(dof,ne,irk), rhs_qy(dof,ne,irk))
    endif

    ! some arbitrary initialization on host
    DO J = 1,NE
       DO K = 1,DOF
          ze(k,j,1) = j/ne
          qx(k,j,1) = ze(k,j,1) + k/ne
          qy(k,j,1) = qx(k,j,1) + j/dof

          rhs_ze(k,j,1) = j/ne
          rhs_qx(k,j,1) = ze(k,j,1) + k/ne
          rhs_qy(k,j,1) = qx(k,j,1) + j/dof
       enddo
    enddo


  end subroutine init


  subroutine rk_update()
    implicit none
    integer :: k, j, i

!$acc parallel loop gang vector async(1) default(present)
    DO J = 1,NE
       do i = 1,IRK-1
          DO K = 1,DOF
             ZE(K,J,i+1) = ZE(K,J,i)  + RHS_ZE(K,J,i)
             QX(K,J,i+1) = QX(K,J,i)  + RHS_QX(K,J,i)
             QY(K,J,i+1) = QY(K,J,i)  + RHS_QY(K,J,i)
          enddo
       enddo
    enddo

  end subroutine rk_update

  attributes(global) subroutine rk_update_cuda()
    implicit none
    integer :: k, j, i

    j = (blockidx%x-1)*blockdim%x + threadidx%x

    if (j <= ne) then
       do i = 1,IRK-1
          DO K = 1,DOF
             ZE(K,J,i+1) = ZE(K,J,i)  + RHS_ZE(K,J,i)
             QX(K,J,i+1) = QX(K,J,i)  + RHS_QX(K,J,i)
             QY(K,J,i+1) = QY(K,J,i)  + RHS_QY(K,J,i)
          enddo
       enddo
    endif

  end subroutine rk_update_cuda


end module util

program main
  use util
  use nvtx
  implicit none
  integer :: i, N, nn

  nn = 300000
  N = 1000
  call init(n=nn, d=3, r=2)
  call nvtxStartRange("Start ACC")
  do i = 1,N
     call nvtxStartRange("rk_update")
     call rk_update()
     call nvtxEndRange
  enddo
  !$acc wait
  call nvtxEndRange

  ! cuda version
  call init(n=nn, d=3, r=2)
  call nvtxStartRange("Start CUF")
  do i = 1,N
     call nvtxStartRange("rk_update_cuda")
     call rk_update_cuda<<<nn/128+1, 128>>>()
     call nvtxEndRange
  enddo
  call nvtxEndRange

end program main

Thank you, using default(present) solves everything. In the full code, I’m seeing significant speedup, around 1.6x. This matches the CUDA version.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.