Hi,
I read about minloc and maxloc not being supported in CUDA Frotran. Is there an update to this? Are there any plans to include these intrinsics?
Thanks, Jan
Hi,
I read about minloc and maxloc not being supported in CUDA Frotran. Is there an update to this? Are there any plans to include these intrinsics?
Thanks, Jan
Hi Jand,
No, sorry, there has been no progress on TPR#17664. I pinged our engineers again. In the mean time, you can add a device routine:
module mtests
contains
attributes(device) function mymaxloc(x)
integer x(:)
integer foo, i
n = size(x)
imax = 1
imaxval = x(1)
do i = 2, n
if (x(i) .gt. imaxval) then
imaxval = x(i)
imax = i
end if
end do
mymaxloc = imax
return
end function
attributes(global) subroutine testany( a )
integer, device :: a(*)
integer ax(4)
ax(1) = 4
ax(2) = 6
ax(3) = 3
ax(4) = 2
i = threadidx%x
a(i) = mymaxloc(ax)
return
end subroutine
end module mtests
program t
use mtests
integer, allocatable, device :: n(:)
integer m(5),k(5)
allocate(n(5))
n = 0
call testany<<<1,5>>> (n)
m = n
do i = 1, 5
print *,i,m(i)
end do
end
Mat,
Any news on maxloc? I’m working on trying to translate some of our CUDA Fortran code into nicer to manage OpenACC code, but one chunk of it has a maxloc call.
With CUDA Fortran I was able to hack in an ugly, ifdef gpu function to do this, but I don’t think you can call CUDA Fortran from inside an accelerator region, right?
Matt
Hi Matt,
We have an engineer that’s working on it right now. He expects a few more days of work and then testing. Hoping for 13.5 if all goes well.
Hi @MatColgrove
This discussion is quite old, has there been any progress regarding the support of minloc and maxloc on the device?
I still get this error with nvfortran 25.1-0 :
Call to NVHPC runtime function not supported - pghpf_minloc_i8
Thanks,
Baptiste
Hi Baptiste,
We added support for these back in the 13.5 release, so I suspect the issue is contextual. We need to inline these into the device code so the runtime routine isn’t called. There’s likely something preventing that.
Do you have a reproducing example so I can take a look?
Thanks,
Mat
Hi @MatColgrove,
I apologize for the delayed reply. I am actually facing two different issues.
The first one is related to the use of minloc
. Here is a small program reproducing the error I mention above:
module mod_tracki
implicit none
public track_particle
contains
subroutine track_particle(x,uf,ng,xp,np)
implicit none
real(kind=8), intent(in), dimension(0:) :: x, uf, xp
integer, intent(in) :: ng, np
integer :: p
real(kind=8), allocatable, dimension(:) :: ufp
allocate(ufp(np))
print *, 'Tracking particles ...'
!$acc enter data create(ufp)
!$acc parallel loop default(present)
do p=1,np
call interp_vel(x,uf,ng,xp(p),ufp(p))
! use ufp(p) to compute other stuff
end do
!$acc exit data copyout(ufp)
print *, 'xp(1) = ', xp(1), ' ufp(1) = ', ufp(1)
end subroutine
!
subroutine interp_vel(x,vel,ng,xp,intp_vel)
!$acc routine seq
implicit none
real(kind=8), dimension(0:), intent(in) :: x,vel
integer, intent(in) :: ng
real(kind=8), intent(in) :: xp
real(kind=8), intent(out) :: intp_vel
integer :: ip, i0, i1
real(kind=8) :: del_x
ip = minloc(abs(x(1:ng)-xp),dim=1)
!ip = myminloc(abs(x(1:ng)-xp))
if(xp > x(ip)) then
i0 = ip
i1 = ip+1
else
i0 = ip-1
i1 = ip
end if
del_x = (xp-x(i0))/(x(i1)-x(i0))
intp_vel = (1-del_x)*vel(i0) + del_x*vel(i1)
! in the actual code, do some long stuff to calculate intp_vel
! --> not convenient to inline in track_particle
end subroutine
!
function myminloc(x)
!$acc routine seq
implicit none
real(kind=8), intent(in) :: x(:)
integer :: i, imin, n, myminloc
real(kind=8) :: xmin
n = size(x)
imin = 1
xmin = x(1)
do i = 2, n
if (x(i) < xmin) then
xmin = x(i)
imin = i
end if
end do
myminloc = imin
end function
end module
program test
use mod_tracki
integer :: np, ng
real(kind=8), allocatable, dimension(:) :: x,uf
real(kind=8), allocatable, dimension(:) :: xp, rn
real(kind=8) :: dx
integer(4), allocatable, dimension(:) :: seed
integer :: i
np = 1000
ng = 128
allocate(x(0:ng+1),uf(0:ng+1))
allocate(xp(np),rn(np))
allocate(seed(128))
seed(:) = 165489632
!
dx = 1./ng
x(0) = -.5*dx
x(ng+1) = 1+.5*dx
do i=1,ng
x(i) =x(i-1)+dx
uf(i)=4*x(i)*(1-x(i)) ! some parabolic profile
end do
call random_seed(put=seed)
call random_number(rn)
xp(:) = rn(:)
!$acc enter data copyin(ng,np,xp(:),x(:),uf(:))
call track_particle(x,uf,ng,xp,np)
print *, '... done !'
end program
When I call my own myminloc
function instead, the code runs correctly.
Here comes my second problem: as I increase np
(let’s say to 100000), I get an Illegal address during kernel execution
error, so I presume an out-of-memory error on the device, which I don’t get because all my arrays should be shared and no private copies should be made for the different threads.
Thank you very much for your help,
Baptiste
Thanks Baptiste,
The problem is because you’re calling minloc from a device subroutine so it’s not able to inline it. If you moved it to the outer parallel region, then it will work as expected.
I get an
Illegal address during kernel execution
error,
It’s a heap overflow.
Passing “abs(x(1:ng)-xp)” requires the compiler to allocate a temp array to store the results. However the default heap size on the device is rather small, so you’re exceeding this as np gets large.
The work around is to set the environment variable “NV_ACC_CUDA_HEAPSIZE” to a larger value. I set it to 256MB and it passes.
% a.out
Tracking particles ...
Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700 (CUDA_ERROR_ILLEGAL_ADDRESS): Illegal address during kernel execution
File: /local/home/mcolgrove/test.F90
Function: track_particle:6
Line: 15
% setenv NV_ACC_CUDA_HEAPSIZE 256MB
% a.out
Tracking particles ...
xp(1) = 0.4558030910257003 ufp(1) = 0.9921541864783876
... done !
Note that in general it’s best to avoid device size allocation. It works, but besides the small default heap size, it can hurt performance. Hence another option is to create another array on the host with the same size as “x”, add it to a “private” clause on the parallel loop so each thread gets a copy, and then pass it into the subroutine. Then in the subroutine set “xtmp = (abs(x(1:ng)-xp)” and finally pass xtmp into your myminloc.
-Mat
It works great with the private xtmp
strategy and myminloc
, thank you very much !