OpenACC routine behavior nvfortran

Hello,

I encountered some issue calling an acc routine within an acc kernels construct, and was able to create a reproducible example that I include below. The routine performs a reduction, which the compiler correctly identifies. I have observed two issues, in short:

  • If the routine call requires the creation of an array temporary this causes a runtime 700 error
  • If the routine performs a vector reduction, the wrong answer is returned

The routine behaves as desired if it is declared as sequential, called outside of a kernels construct, or written inline with the kernels code. Two of these options are not desirable for performance reasons, the third is unwieldy code.

Is any of this behavior expected? I am using nvfortran 24.7 with cuda 12.2 on an a100 gpu

! A minimal nvhpc error case
module formulae
  implicit none

contains
  
!"""
! Integrate an equispaced grid, trapezoidal
!
!"""
real(kind=8) function integrate(y,dx,n)
  !$acc routine seq
  implicit none
  real(kind=8), intent(in) :: dx,y(:)
  integer, intent(in) :: n

  integrate=(sum(y(1:n))-(y(1)+y(n))*0.5)*dx
end function integrate

real(kind=8) function integrate_vec(y,dx,n)
  !$acc routine vector
  implicit none
  real(kind=8), intent(in) :: dx,y(:)
  integer, intent(in) :: n

  integrate_vec=(sum(y(1:n))-(y(1)+y(n))*0.5)*dx
end function integrate_vec

end module formulae

program test
  use openacc

  call test_cpu   ! Returns 9999.99
  call test_seq   ! Returns 9999.99
  call test_vec   ! Returns  312.49
  call test_inline !Returns 9999.99
  call test_array_temporary ! error 700: Illegal address during kernel execution
  
end program test

subroutine test_array_temporary
  use formulae
  
  implicit none
  real(kind=8),allocatable :: a(:),b(:)
  real(kind=8) :: y,dx
  integer :: n
  
  n=1e6
  dx=0.01
  allocate(a(n),b(n))
  !$acc enter data create(a,b)
  !$acc kernels present(a,b)
  a=1.0
  b=1.0
  y=integrate(a*b,dx,n)
  !$acc end kernels
  !$acc exit data delete(a,b)
  deallocate(a,b)

  write(6,*) 'Test Array Temporary:', y
  
end subroutine test_array_temporary

subroutine test_seq
  use formulae
  
  implicit none
  real(kind=8),allocatable :: a(:),b(:),c(:)
  real(kind=8) :: y,dx
  integer :: n
  
  n=1e6
  dx=0.01
  allocate(a(n),b(n),c(n))
  !$acc enter data create(a,b,c)
  !$acc kernels present(a,b,c)
  a=1.0
  b=1.0
  c=a*b
  y=integrate(c,dx,n)
  !$acc end kernels
  !$acc exit data delete(a,b,c)
  deallocate(a,b,c)

  write(6,*) 'test seq:', y
  
end subroutine test_seq

subroutine test_cpu
  use formulae
  
  implicit none
  real(kind=8),allocatable :: a(:),b(:)
  real(kind=8) :: y,dx
  integer :: n
  
  n=1e6
  dx=0.01
  allocate(a(n),b(n))
  a=1.0
  b=1.0
  y=integrate(a*b,dx,n)
  deallocate(a,b)

  write(6,*) 'test cpu:', y
  
end subroutine test_cpu

subroutine test_vec
  use formulae
  
  implicit none
  real(kind=8),allocatable :: a(:),b(:),c(:)
  real(kind=8) :: y,dx
  integer :: n
  
  n=1e6
  dx=0.01
  allocate(a(n),b(n),c(n))
  !$acc enter data create(a,b,c)
  !$acc kernels present(a,b,c)
  a=1.0
  b=1.0
  c=a*b
  y=integrate_vec(c,dx,n)
  !$acc end kernels
  !$acc exit data delete(a,b,c)
  deallocate(a,b,c)

  write(6,*) 'test vec:', y
  
end subroutine test_vec

subroutine test_inline
  use formulae
  
  implicit none
  real(kind=8),allocatable :: a(:),b(:),c(:)
  real(kind=8) :: y,dx
  integer :: n
  
  n=1e6
  dx=0.01
  allocate(a(n),b(n),c(n))
  !$acc enter data create(a,b,c)
  !$acc kernels present(a,b,c)
  a=1.0
  b=1.0
  c=a*b
  y=(sum(c)-0.5*(c(1)+c(n)))*dx
  !$acc end kernels
  !$acc exit data delete(a,b,c)
  deallocate(a,b,c)

  write(6,*) 'test inline:', y
  
end subroutine test_inline

Hi Brett,

y=integrate(a*b,dx,n)

Yes, this is going to be problematic. The created temp array will be private and allocated on the stack, which can lead to stack overflows (the device stack is very small). Now I’m not 100% that this is the complete problem here, but I’d recommend avoiding this syntax and use the “c” array instead. Especially since you don’t want the temp array to be private.

For the “vec” version, what’s happening is that for this section, the compiler is splitting each line into it’s own kernel. However, the “integrate” line can’t be parallelized so a serial kernel is created. Given there’s only one thread when it enters the vector routine, which is expecting 32 threads, only 1/32 of the reduction is getting computed. Better to more the kernels region inside of “integrate_vec” so it can be parallelized.

Hope this helps,
Mat

Thanks Mat,

To clarify the second point - I think I understand. I see the following from Minfo=accel

integrate_vec:
     20, Generating NVIDIA GPU code
         26, !$acc loop vector ! threadidx%x
             Generating implicit reduction(+:y$r)
             Vector barrier inserted for vector loop reduction
     26, Loop is parallelizable
...
test_vec:
    142, Generating enter data create(b(:),c(:),a(:))
    143, Generating present(a(:),b(:),c(:))
    144, Loop is parallelizable
         Generating NVIDIA GPU code
        144, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    145, Generating present(a(:),c(:),b(:))
         Loop is parallelizable
         Generating NVIDIA GPU code
        145, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
    146, Loop is parallelizable
         Generating NVIDIA GPU code
        146, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
    147, Accelerator serial kernel generated
         Generating NVIDIA GPU code
    149, Generating exit data delete(c(:),b(:),a(:))

I saw the reduction in integrate_vec but did not notice or appreciate the serial kernel on line 147. What I really want it to do is “use all threads” in integrate_vec, but it doesn’t look like this can be done? Is this a limitation of the routine construct?

The CUDA kernel launch controls the schedule (i.e. the number of blocks/gang and threads/vectors). In OpenACC this corresponds to the compute constructs, “kernels” and “parallel”.

With “kernels”, the compiler needs to find the parallelization itself. It’s why “kernels” will work with array syntax, but here the parallelism is hidden away in the routine so it doesn’t know what to parallelize.

Now you can fake it by adding a DO loop with a trip count of one, then there will something in the kernels construct that can be parallelized, but you’d still only be using 32 threads.

Alternately, you could put the call in a “parallel” region, where you can express what parallelism to use, but again, there will only be 32 threads.

You’re much better off moving the kernels construct into the routine so it can parallelize across all million elements.

If you set the environment variable NV_ACC_TIME=1, you’ll get a quick profile from the OpenACC runtime. On my system adding the DO or using “parallel” takes about 6400 ms since there’s only 32 threads. While moving the kernels construct into the routine drops the time to 12ms given it’s now using about a million threads.

The other option is to use the flag “-Minline=reshape” so integrate_vec gets inlined into the kernels region and the compiler can now “see” the parallelization. But this is no different than your manually inlined version and inlining can be fickle, so I don’t like relying on it.

Thanks, I was hoping not to do that since it is surrounded by other kernels operations. So that would mean I have to end the kernels, make the function call, then start a new kernels. Which is not the worst thing, but I was hoping the routine clause could be more transparent.