minloc, maxloc

Hi,


I read about minloc and maxloc not being supported in CUDA Frotran. Is there an update to this? Are there any plans to include these intrinsics?

Thanks, Jan

Hi Jand,

No, sorry, there has been no progress on TPR#17664. I pinged our engineers again. In the mean time, you can add a device routine:

module mtests
contains
attributes(device) function mymaxloc(x)
integer x(:)
integer foo, i
n = size(x)
imax = 1
imaxval = x(1)
do i = 2, n
if (x(i) .gt. imaxval) then
imaxval = x(i)
imax = i
end if
end do
mymaxloc = imax
return
end function

attributes(global) subroutine testany( a )
integer, device :: a(*)
integer ax(4)
ax(1) = 4
ax(2) = 6
ax(3) = 3
ax(4) = 2
i = threadidx%x
a(i) = mymaxloc(ax)
return
end subroutine
end module mtests

program t
use mtests
integer, allocatable, device :: n(:)
integer m(5),k(5)
allocate(n(5))
n = 0
call testany<<<1,5>>> (n)
m = n
do i = 1, 5
print *,i,m(i)
end do
end
  • Mat

Mat,

Any news on maxloc? I’m working on trying to translate some of our CUDA Fortran code into nicer to manage OpenACC code, but one chunk of it has a maxloc call.

With CUDA Fortran I was able to hack in an ugly, ifdef gpu function to do this, but I don’t think you can call CUDA Fortran from inside an accelerator region, right?

Matt

Hi Matt,

We have an engineer that’s working on it right now. He expects a few more days of work and then testing. Hoping for 13.5 if all goes well.

  • Mat

Hi @MatColgrove
This discussion is quite old, has there been any progress regarding the support of minloc and maxloc on the device?
I still get this error with nvfortran 25.1-0 :

Call to NVHPC runtime function not supported - pghpf_minloc_i8

Thanks,
Baptiste

Hi Baptiste,

We added support for these back in the 13.5 release, so I suspect the issue is contextual. We need to inline these into the device code so the runtime routine isn’t called. There’s likely something preventing that.

Do you have a reproducing example so I can take a look?

Thanks,
Mat

Hi @MatColgrove,

I apologize for the delayed reply. I am actually facing two different issues.

The first one is related to the use of minloc. Here is a small program reproducing the error I mention above:

module mod_tracki 
 
 implicit none 
 public track_particle
 contains
 subroutine track_particle(x,uf,ng,xp,np)   
   implicit none 
   real(kind=8), intent(in), dimension(0:) :: x, uf, xp
   integer, intent(in) :: ng, np
   integer :: p
   real(kind=8), allocatable, dimension(:) :: ufp 
   allocate(ufp(np)) 
   print *, 'Tracking particles ...'
   !$acc enter data create(ufp) 
   !$acc parallel loop default(present)
   do p=1,np
     call interp_vel(x,uf,ng,xp(p),ufp(p))
     ! use ufp(p) to compute other stuff 
   end do
   !$acc exit data copyout(ufp)
   print *, 'xp(1) = ', xp(1), ' ufp(1) = ', ufp(1)
 end subroutine
 !
 subroutine interp_vel(x,vel,ng,xp,intp_vel)
    !$acc routine seq
    implicit none
    real(kind=8), dimension(0:), intent(in) :: x,vel
    integer, intent(in) :: ng
    real(kind=8), intent(in)  :: xp
    real(kind=8), intent(out) :: intp_vel
    integer :: ip, i0, i1
    real(kind=8) :: del_x

    ip = minloc(abs(x(1:ng)-xp),dim=1)
    !ip = myminloc(abs(x(1:ng)-xp)) 
    if(xp > x(ip)) then
      i0 = ip
      i1 = ip+1
    else
      i0 = ip-1
      i1 = ip
    end if
    del_x    = (xp-x(i0))/(x(i1)-x(i0)) 
    intp_vel = (1-del_x)*vel(i0) + del_x*vel(i1)
    ! in the actual code, do some long stuff to calculate intp_vel
    ! --> not convenient to inline in track_particle
 end subroutine
 !
 function myminloc(x)
    !$acc routine seq
    implicit none
    real(kind=8), intent(in) :: x(:)
    integer :: i, imin, n, myminloc
    real(kind=8) :: xmin
    n = size(x)
    imin = 1
    xmin = x(1)
    do i = 2, n
      if (x(i) < xmin) then
        xmin = x(i)
        imin = i
      end if
    end do
    myminloc = imin 
 end function
end module

program test
  use mod_tracki
  integer :: np, ng
  real(kind=8), allocatable, dimension(:) :: x,uf
  real(kind=8), allocatable, dimension(:) :: xp, rn
  real(kind=8) :: dx
  integer(4), allocatable, dimension(:) :: seed
  integer :: i  
  np = 1000
  ng = 128
  allocate(x(0:ng+1),uf(0:ng+1))
  allocate(xp(np),rn(np))
  allocate(seed(128))
  seed(:) = 165489632
  !
  dx      = 1./ng
  x(0)    =  -.5*dx 
  x(ng+1) = 1+.5*dx
  do i=1,ng
    x(i) =x(i-1)+dx
    uf(i)=4*x(i)*(1-x(i)) ! some parabolic profile
  end do
  call random_seed(put=seed)
  call random_number(rn)
  xp(:) = rn(:)
  !$acc enter data copyin(ng,np,xp(:),x(:),uf(:)) 
  call track_particle(x,uf,ng,xp,np)
  print *, '... done !'
end program 

When I call my own myminloc function instead, the code runs correctly.

Here comes my second problem: as I increase np (let’s say to 100000), I get an Illegal address during kernel execution error, so I presume an out-of-memory error on the device, which I don’t get because all my arrays should be shared and no private copies should be made for the different threads.

Thank you very much for your help,

Baptiste

1 Like

Thanks Baptiste,

The problem is because you’re calling minloc from a device subroutine so it’s not able to inline it. If you moved it to the outer parallel region, then it will work as expected.

I get an Illegal address during kernel execution error,

It’s a heap overflow.

Passing “abs(x(1:ng)-xp)” requires the compiler to allocate a temp array to store the results. However the default heap size on the device is rather small, so you’re exceeding this as np gets large.

The work around is to set the environment variable “NV_ACC_CUDA_HEAPSIZE” to a larger value. I set it to 256MB and it passes.

% a.out
 Tracking particles ...
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /local/home/mcolgrove/test.F90
 Function: track_particle:6
 Line: 15

% setenv NV_ACC_CUDA_HEAPSIZE 256MB
% a.out
 Tracking particles ...
 xp(1) =    0.4558030910257003       ufp(1) =    0.9921541864783876
 ... done !

Note that in general it’s best to avoid device size allocation. It works, but besides the small default heap size, it can hurt performance. Hence another option is to create another array on the host with the same size as “x”, add it to a “private” clause on the parallel loop so each thread gets a copy, and then pass it into the subroutine. Then in the subroutine set “xtmp = (abs(x(1:ng)-xp)” and finally pass xtmp into your myminloc.

-Mat

1 Like

It works great with the private xtmp strategy and myminloc, thank you very much !