Hi,
I have the following code written in Fortran (I use the PGI 18.3 compiler) and the output from the compiler:
c$acc parallel present(unkno,vicop,deltp)
c
c -----loop over the points, computing the allowable timestep
c
c$acc loop reduction(min:dtmin) collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y= nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c$acc loop reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
dtadv=coudx/max(epsil,csoun
& +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
& +unkno(3,ipoin)*unkno(3,ipoin)
& +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))
dtpoi=min(dtadv,dtvis,dtcon)
deltp(ipoin)=dtpoi
dtmin=min(dtmin,dtpoi)
1220 continue
1210 continue
1200 continue
c$acc end parallel
c
dtmi1=1.001d+00*dtmin
c
c
c$acc kernels present(deltp)
c$acc loop reduction(max:ipmin) collapse(2)
do 1400 ipoiz=izmin,izmax
do 1410 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y= nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c!$gpu parallel do,reduction(max:ipmin)
c
c$acc loop reduction(max:ipmin)
do 1420 ipoin=ipii0,ipii1
if(deltp(ipoin).lt.dtmi1) then
ipmin=max(ipmin,ipoin)
endif
1420 continue
1410 continue
1400 continue
c
c$acc end kernels
27100, Generating implicit copy(dtmin)
Generating present(unkno(:,:),vicop(:,:))
Accelerator kernel generated
Generating Tesla code
27105, !$acc loop gang collapse(2) ! blockidx%x
27106, ! blockidx%x collapsed
Generating reduction(min:dtmin)
27114, !$acc loop vector(128) ! threadidx%x
Generating reduction(min:dtmin)
27100, Generating present(deltp(:))
27114, Loop is parallelizable
27132, Generating implicit copy(ipmin)
Generating present(deltp(:))
27134, Loop is parallelizable
27135, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
27134, !$acc loop gang collapse(2) ! blockidx%x
27135, ! blockidx%x collapsed
Generating reduction(max:ipmin)
27145, !$acc loop vector(128) ! threadidx%x
Generating reduction(max:ipmin)
27145, Loop is parallelizable
This code works just fine, however I would like to create a big kernel with both nested loops included. This loops are called thousands of times sequentially and if both loops are in the same kernel it would be possible to reduce the communication overhead created between cpu and gpu (I saw it using Pgprof).
My first approach is:
c$acc kernels present(unkno,vicop,deltp)
c
c -----loop over the points, computing the allowable timestep
c
c$acc loop reduction(min:dtmin) collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y= nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c$acc loop reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
dtadv=coudx/max(epsil,csoun
& +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
& +unkno(3,ipoin)*unkno(3,ipoin)
& +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))
dtpoi=min(dtadv,dtvis,dtcon)
deltp(ipoin)=dtpoi
dtmin=min(dtmin,dtpoi)
1220 continue
1210 continue
1200 continue
c
dtmi1=1.001d+00*dtmin
c
c
c$acc loop reduction(max:ipmin) collapse(2)
do 1400 ipoiz=izmin,izmax
do 1410 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y= nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c!$gpu parallel do,reduction(max:ipmin)
c
c$acc loop reduction(max:ipmin)
do 1420 ipoin=ipii0,ipii1
if(deltp(ipoin).lt.dtmi1) then
ipmin=max(ipmin,ipoin)
endif
1420 continue
1410 continue
1400 continue
c
c$acc end kernels
0, Accelerator kernel generated
Generating Tesla code
27100, Generating implicit copy(ipmin)
Generating present(unkno(:,:),vicop(:,:),deltp(:))
Generating implicit copy(dtmin)
27105, Parallelization requires privatization of deltp as well as last value
27106, Parallelization requires privatization of deltp as well as last value
Accelerator kernel generated
Generating Tesla code
27105, !$acc loop seq collapse(2)
27106, collapsed
27114, !$acc loop seq
27114, Parallelization requires privatization of deltp as well as last value
27126, Accelerator serial kernel generated
27132, Loop is parallelizable
27133, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
27132, !$acc loop gang collapse(2) ! blockidx%x
27133, ! blockidx%x collapsed
Generating reduction(max:ipmin)
27143, !$acc loop vector(128) ! threadidx%x
Generating reduction(max:ipmin)
27143, Loop is parallelizable
As it is seen in the compiler report the first loop is of the type “loop seq” and the performance obtained is really poor. When trying to solve this problem I added deltp as a private variable in the first loop:
c$acc kernels present(unkno,vicop,deltp)
c
c -----loop over the points, computing the allowable timestep
c
c$acc loop private(deltp) reduction(min:dtmin) collapse(2)
do 1200 ipoiz=izmin,izmax
do 1210 ipoiy=iymin,iymax
ipo0z=nxny*(ipoiz-1)
ipo0y= nx*(ipoiy-1)+ipo0z
c
ipii0=ixmin+ipo0y
ipii1=ixmax+ipo0y
c
c$acc loop private(deltp) reduction(min:dtmin)
do 1220 ipoin=ipii0,ipii1
dtadv=coudx/max(epsil,csoun
& +cvelo*sqrt(unkno(2,ipoin)*unkno(2,ipoin)
& +unkno(3,ipoin)*unkno(3,ipoin)
& +unkno(4,ipoin)*unkno(4,ipoin)))
dtvis=couvi/max(epsil,vicop(1,ipoin))
dtcon=couco/max(epsil,vicop(2,ipoin))
dtpoi=min(dtadv,dtvis,dtcon)
deltp(ipoin)=dtpoi
dtmin=min(dtmin,dtpoi)
1220 continue
1210 continue
1200 continue
c
... ... ...
c
c$acc end kernels
But I get the following error:
PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable deltp (fd_genpur.f: 27104)
PGF90-S-0155-A data clause for a variable appears within another region with a data clause for the same variable deltp (fd_genpur.f: 27113)
0 inform, 0 warnings, 2 severes, 0 fatal for fd_predidtacompd1_acc
I tried adding the private directive in several ways but I always get a similar type of error. Is there any way to create a big kernel with this two loops inside?
Thank you,
Alejandro