minloc, maxloc

jand · October 28, 2012, 8:02am

Hi,

I read about minloc and maxloc not being supported in CUDA Frotran. Is there an update to this? Are there any plans to include these intrinsics?

Thanks, Jan

MatColgrove · October 29, 2012, 5:01pm

Hi Jand,

No, sorry, there has been no progress on TPR#17664. I pinged our engineers again. In the mean time, you can add a device routine:

module mtests
contains
attributes(device) function mymaxloc(x)
integer x(:)
integer foo, i
n = size(x)
imax = 1
imaxval = x(1)
do i = 2, n
if (x(i) .gt. imaxval) then
imaxval = x(i)
imax = i
end if
end do
mymaxloc = imax
return
end function

attributes(global) subroutine testany( a )
integer, device :: a(*)
integer ax(4)
ax(1) = 4
ax(2) = 6
ax(3) = 3
ax(4) = 2
i = threadidx%x
a(i) = mymaxloc(ax)
return
end subroutine
end module mtests

program t
use mtests
integer, allocatable, device :: n(:)
integer m(5),k(5)
allocate(n(5))
n = 0
call testany<<<1,5>>> (n)
m = n
do i = 1, 5
print *,i,m(i)
end do
end

Mat

TheMatt · April 17, 2013, 5:37pm

Mat,

Any news on maxloc? I’m working on trying to translate some of our CUDA Fortran code into nicer to manage OpenACC code, but one chunk of it has a maxloc call.

With CUDA Fortran I was able to hack in an ugly, ifdef gpu function to do this, but I don’t think you can call CUDA Fortran from inside an accelerator region, right?

Matt

MatColgrove · April 17, 2013, 6:20pm

Hi Matt,

We have an engineer that’s working on it right now. He expects a few more days of work and then testing. Hoping for 13.5 if all goes well.

Mat

b.hardy · February 19, 2025, 4:36pm

Hi @MatColgrove
This discussion is quite old, has there been any progress regarding the support of minloc and maxloc on the device?
I still get this error with nvfortran 25.1-0 :

Call to NVHPC runtime function not supported - pghpf_minloc_i8

Thanks,
Baptiste

MatColgrove · February 19, 2025, 5:58pm

Hi Baptiste,

We added support for these back in the 13.5 release, so I suspect the issue is contextual. We need to inline these into the device code so the runtime routine isn’t called. There’s likely something preventing that.

Do you have a reproducing example so I can take a look?

Thanks,
Mat

b.hardy · February 24, 2025, 4:31pm

Hi @MatColgrove,

I apologize for the delayed reply. I am actually facing two different issues.

The first one is related to the use of minloc. Here is a small program reproducing the error I mention above:

module mod_tracki 
 
 implicit none 
 public track_particle
 contains
 subroutine track_particle(x,uf,ng,xp,np)   
   implicit none 
   real(kind=8), intent(in), dimension(0:) :: x, uf, xp
   integer, intent(in) :: ng, np
   integer :: p
   real(kind=8), allocatable, dimension(:) :: ufp 
   allocate(ufp(np)) 
   print *, 'Tracking particles ...'
   !$acc enter data create(ufp) 
   !$acc parallel loop default(present)
   do p=1,np
     call interp_vel(x,uf,ng,xp(p),ufp(p))
     ! use ufp(p) to compute other stuff 
   end do
   !$acc exit data copyout(ufp)
   print *, 'xp(1) = ', xp(1), ' ufp(1) = ', ufp(1)
 end subroutine
 !
 subroutine interp_vel(x,vel,ng,xp,intp_vel)
    !$acc routine seq
    implicit none
    real(kind=8), dimension(0:), intent(in) :: x,vel
    integer, intent(in) :: ng
    real(kind=8), intent(in)  :: xp
    real(kind=8), intent(out) :: intp_vel
    integer :: ip, i0, i1
    real(kind=8) :: del_x

    ip = minloc(abs(x(1:ng)-xp),dim=1)
    !ip = myminloc(abs(x(1:ng)-xp)) 
    if(xp > x(ip)) then
      i0 = ip
      i1 = ip+1
    else
      i0 = ip-1
      i1 = ip
    end if
    del_x    = (xp-x(i0))/(x(i1)-x(i0)) 
    intp_vel = (1-del_x)*vel(i0) + del_x*vel(i1)
    ! in the actual code, do some long stuff to calculate intp_vel
    ! --> not convenient to inline in track_particle
 end subroutine
 !
 function myminloc(x)
    !$acc routine seq
    implicit none
    real(kind=8), intent(in) :: x(:)
    integer :: i, imin, n, myminloc
    real(kind=8) :: xmin
    n = size(x)
    imin = 1
    xmin = x(1)
    do i = 2, n
      if (x(i) < xmin) then
        xmin = x(i)
        imin = i
      end if
    end do
    myminloc = imin 
 end function
end module

program test
  use mod_tracki
  integer :: np, ng
  real(kind=8), allocatable, dimension(:) :: x,uf
  real(kind=8), allocatable, dimension(:) :: xp, rn
  real(kind=8) :: dx
  integer(4), allocatable, dimension(:) :: seed
  integer :: i  
  np = 1000
  ng = 128
  allocate(x(0:ng+1),uf(0:ng+1))
  allocate(xp(np),rn(np))
  allocate(seed(128))
  seed(:) = 165489632
  !
  dx      = 1./ng
  x(0)    =  -.5*dx 
  x(ng+1) = 1+.5*dx
  do i=1,ng
    x(i) =x(i-1)+dx
    uf(i)=4*x(i)*(1-x(i)) ! some parabolic profile
  end do
  call random_seed(put=seed)
  call random_number(rn)
  xp(:) = rn(:)
  !$acc enter data copyin(ng,np,xp(:),x(:),uf(:)) 
  call track_particle(x,uf,ng,xp,np)
  print *, '... done !'
end program

When I call my own myminloc function instead, the code runs correctly.

Here comes my second problem: as I increase np (let’s say to 100000), I get an Illegal address during kernel execution error, so I presume an out-of-memory error on the device, which I don’t get because all my arrays should be shared and no private copies should be made for the different threads.

Thank you very much for your help,

Baptiste

MatColgrove · February 24, 2025, 5:55pm

Thanks Baptiste,

The problem is because you’re calling minloc from a device subroutine so it’s not able to inline it. If you moved it to the outer parallel region, then it will work as expected.

I get an Illegal address during kernel execution error,

It’s a heap overflow.

Passing “abs(x(1:ng)-xp)” requires the compiler to allocate a temp array to store the results. However the default heap size on the device is rather small, so you’re exceeding this as np gets large.

The work around is to set the environment variable “NV_ACC_CUDA_HEAPSIZE” to a larger value. I set it to 256MB and it passes.

% a.out
 Tracking particles ...
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
 File: /local/home/mcolgrove/test.F90
 Function: track_particle:6
 Line: 15

% setenv NV_ACC_CUDA_HEAPSIZE 256MB
% a.out
 Tracking particles ...
 xp(1) =    0.4558030910257003       ufp(1) =    0.9921541864783876
 ... done !

Note that in general it’s best to avoid device size allocation. It works, but besides the small default heap size, it can hurt performance. Hence another option is to create another array on the host with the same size as “x”, add it to a “private” clause on the parallel loop so each thread gets a copy, and then pass it into the subroutine. Then in the subroutine set “xtmp = (abs(x(1:ng)-xp)” and finally pass xtmp into your myminloc.

-Mat

b.hardy · February 25, 2025, 9:05am

It works great with the private xtmp strategy and myminloc, thank you very much !

Topic		Replies	Views
Error while calling module subroutine : call to cuStreamSynchronize returned error 700: Illegal address during kernel execution nvc, nvc++ and nvfortran	3	482	August 4, 2021
malloc isn't found when used in a header file CUDA Programming and Performance	9	5074	December 19, 2010
'invalid device ordinal' (cudaErrorInvalidDevice) CUDA Programming and Performance	6	5569	August 25, 2015
Device Code Skipped at Runtime Legacy PGI Compilers	4	4009	December 8, 2016
Declaring local arrays in device code Legacy PGI Compilers	16	9070	June 8, 2012
Deep subroutine use causes cudaError_enum? CUDA Programming and Performance	14	14384	February 6, 2010
"Host array used in CUF kernel" Legacy PGI Compilers	15	1664	March 18, 2021
OpenACC routine call inside OpenMP parallel loop Legacy PGI Compilers	7	1138	October 12, 2021
cuda fortran module data Legacy PGI Compilers	6	8139	September 9, 2010
How compile the kernel subroutine containing dgetrf Legacy PGI Compilers	4	5719	December 4, 2013

minloc, maxloc

Related topics