I read about minloc and maxloc not being supported in CUDA Frotran. Is there an update to this? Are there any plans to include these intrinsics?
Thanks, Jan
Hi Jand,
No, sorry, there has been no progress on TPR#17664. I pinged our engineers again. In the mean time, you can add a device routine:
module mtests
attributes(device) function mymaxloc(x)
integer x(:)
integer foo, i
n = size(x)
imax = 1
imaxval = x(1)
do i = 2, n
if (x(i) .gt. imaxval) then
imaxval = x(i)
imax = i
end if
end do
mymaxloc = imax
end function
attributes(global) subroutine testany( a )
integer, device :: a(*)
integer ax(4)
ax(1) = 4
ax(2) = 6
ax(3) = 3
ax(4) = 2
i = threadidx%x
a(i) = mymaxloc(ax)
end subroutine
end module mtests
program t
use mtests
integer, allocatable, device :: n(:)
integer m(5),k(5)
n = 0
call testany<<<1,5>>> (n)
m = n
do i = 1, 5
print *,i,m(i)
end do
Any news on maxloc? I’m working on trying to translate some of our CUDA Fortran code into nicer to manage OpenACC code, but one chunk of it has a maxloc call.
With CUDA Fortran I was able to hack in an ugly, ifdef gpu function to do this, but I don’t think you can call CUDA Fortran from inside an accelerator region, right?
Hi Matt,
We have an engineer that’s working on it right now. He expects a few more days of work and then testing. Hoping for 13.5 if all goes well.
Hi @MatColgrove
This discussion is quite old, has there been any progress regarding the support of minloc and maxloc on the device?
I still get this error with nvfortran 25.1-0 :
Call to NVHPC runtime function not supported - pghpf_minloc_i8
Hi Baptiste,
We added support for these back in the 13.5 release, so I suspect the issue is contextual. We need to inline these into the device code so the runtime routine isn’t called. There’s likely something preventing that.
Do you have a reproducing example so I can take a look?
Hi @MatColgrove,
I apologize for the delayed reply. I am actually facing two different issues.
The first one is related to the use of minloc
. Here is a small program reproducing the error I mention above:
module mod_tracki
implicit none
public track_particle
subroutine track_particle(x,uf,ng,xp,np)
implicit none
real(kind=8), intent(in), dimension(0:) :: x, uf, xp
integer, intent(in) :: ng, np
integer :: p
real(kind=8), allocatable, dimension(:) :: ufp
print *, 'Tracking particles ...'
!$acc enter data create(ufp)
!$acc parallel loop default(present)
do p=1,np
call interp_vel(x,uf,ng,xp(p),ufp(p))
! use ufp(p) to compute other stuff
end do
!$acc exit data copyout(ufp)
print *, 'xp(1) = ', xp(1), ' ufp(1) = ', ufp(1)
end subroutine
subroutine interp_vel(x,vel,ng,xp,intp_vel)
!$acc routine seq
implicit none
real(kind=8), dimension(0:), intent(in) :: x,vel
integer, intent(in) :: ng
real(kind=8), intent(in) :: xp
real(kind=8), intent(out) :: intp_vel
integer :: ip, i0, i1
real(kind=8) :: del_x
ip = minloc(abs(x(1:ng)-xp),dim=1)
!ip = myminloc(abs(x(1:ng)-xp))
if(xp > x(ip)) then
i0 = ip
i1 = ip+1
i0 = ip-1
i1 = ip
end if
del_x = (xp-x(i0))/(x(i1)-x(i0))
intp_vel = (1-del_x)*vel(i0) + del_x*vel(i1)
! in the actual code, do some long stuff to calculate intp_vel
! --> not convenient to inline in track_particle
end subroutine
function myminloc(x)
!$acc routine seq
implicit none
real(kind=8), intent(in) :: x(:)
integer :: i, imin, n, myminloc
real(kind=8) :: xmin
n = size(x)
imin = 1
xmin = x(1)
do i = 2, n
if (x(i) < xmin) then
xmin = x(i)
imin = i
end if
end do
myminloc = imin
end function
end module
program test
use mod_tracki
integer :: np, ng
real(kind=8), allocatable, dimension(:) :: x,uf
real(kind=8), allocatable, dimension(:) :: xp, rn
real(kind=8) :: dx
integer(4), allocatable, dimension(:) :: seed
integer :: i
np = 1000
ng = 128
seed(:) = 165489632
dx = 1./ng
x(0) = -.5*dx
x(ng+1) = 1+.5*dx
do i=1,ng
x(i) =x(i-1)+dx
uf(i)=4*x(i)*(1-x(i)) ! some parabolic profile
end do
call random_seed(put=seed)
call random_number(rn)
xp(:) = rn(:)
!$acc enter data copyin(ng,np,xp(:),x(:),uf(:))
call track_particle(x,uf,ng,xp,np)
print *, '... done !'
end program
When I call my own myminloc
function instead, the code runs correctly.
Here comes my second problem: as I increase np
(let’s say to 100000), I get an Illegal address during kernel execution
error, so I presume an out-of-memory error on the device, which I don’t get because all my arrays should be shared and no private copies should be made for the different threads.
Thank you very much for your help,
Thanks Baptiste,
The problem is because you’re calling minloc from a device subroutine so it’s not able to inline it. If you moved it to the outer parallel region, then it will work as expected.
I get an
Illegal address during kernel execution
It’s a heap overflow.
Passing “abs(x(1:ng)-xp)” requires the compiler to allocate a temp array to store the results. However the default heap size on the device is rather small, so you’re exceeding this as np gets large.
The work around is to set the environment variable “NV_ACC_CUDA_HEAPSIZE” to a larger value. I set it to 256MB and it passes.
% a.out
Tracking particles ...
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
File: /local/home/mcolgrove/test.F90
Function: track_particle:6
Line: 15
% a.out
Tracking particles ...
xp(1) = 0.4558030910257003 ufp(1) = 0.9921541864783876
... done !
Note that in general it’s best to avoid device size allocation. It works, but besides the small default heap size, it can hurt performance. Hence another option is to create another array on the host with the same size as “x”, add it to a “private” clause on the parallel loop so each thread gets a copy, and then pass it into the subroutine. Then in the subroutine set “xtmp = (abs(x(1:ng)-xp)” and finally pass xtmp into your myminloc.
It works great with the private xtmp
strategy and myminloc
, thank you very much !