Acc loop with a FORTRAN subroutine call (acc routine worker) is not parallelized!

All the compute intensive execution is in this subroutine, compiled as a acc routine worker.
The 2 collapsed loops that call it must be parallel at the gang or worker level.
The OMP pragmas are not active , they are for the CPU version.
We are in an acc kernel region, we use a deep copy of the domain(:) derived type, with pointers to the partition/subpartition data segments

!
! *** Main compute routine inside 2 nested OMP loops
! *** *****************************************************************
!
!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,nparts,nsub,
!$OMP2 istep,isweep,istage,icomput)
!$acc loop collapse(2)
do ipart=1,nparts
!
!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,ipart,nparts,nsub,
!$OMP2 istep,isweep,istage,icomput)
do isub=1,nsub
call Nxo_compute(istep,isweep,istage,icomput,
1 ipart,isub,nparts,nsub,
2 domain,domain(ipart)%subdomain(isub),
3 domain(ipart)%subdomain(isub)%problem)
enddo
!$OMP END PARALLEL DO
enddo
!$OMP END PARALLEL DO
!

Here is the declaration of the Nxo_compute subroutine

  interface
  subroutine Nxo_compute(istep,isweep,istage,icomput,ipart,isub,
 1                         nparts,nsub,domain,subdomain,problem)

!$acc routine(Nxo_compute) worker
use Nxo_commun
#ifdef _OPENACC
use openacc
use cudafor
#endif
!
integer istep,isweep,istage,icomput,ipart,isub,nparts,nsub
type(domain_type) domain(*)
type(subdomain_type) subdomain
type(problem_type) problem
!
end subroutine
end interface

Hi jm.legouez,

Unfortunately there’s not enough information here to help so I’ll need to ask some follow-up questions.

When you say that the subroutine call is not parallelized, how is this being determined?
Are you getting compiler feedback messages saying it’s not being parallelized? If so, what’s the message?

Also, although you note you are using “kernels”, but this is not shown in the code snip-it. What is shown is:

A “loop” directive outside of a kernels or parallel region will be ignored. Hence this should be “!$acc parallel loop collapse(2) gang”.

Also, the snip-it doesn’t show where you’ve put the loop directive within the routine itself. If it’s missing, then this could be the issue. The loop directive defines where the work is distributed across the workers/vectors within the routine. Without it, the work is performed redundantly by all workers/vectors.

If you can provide a minimal reproducing example, that would help quite a bit in helping you determine the issue.

Thanks,
Mat

Hi, mat,
Thank you for your help.
Well the executable is now contructed, I can’t really explain why, it seems there were
writes in a parallel loop.
But it stays very long on the machine, and the results are different than on the CPU.
The profiler says :
[slurm 114605] spiro-g201-clu > pgprof -i pgprof.dat
======== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.87% 697.100s 3 232.367s 3.79550s 685.047s nextflow_1003_gpu
0.13% 878.60ms 2048 429.01us 159.49us 990.11us cudafor_lib_internals_partialmnvshflshfli4_
0.00% 34.628ms 2048 16.907us 14.591us 24.736us cudafor_lib_internals_partialmxvshflshfli4_
0.00% 5.9450ms 4114 1.4450us 1.1830us 3.1360us [CUDA memcpy DtoH]
0.00% 39.168us 36 1.0880us 1.0240us 1.7280us [CUDA memset]
0.00% 2.0800us 1 2.0800us 2.0800us 2.0800us [CUDA memcpy HtoD]

OK since 1003 is the number of the line where the compute kernel is called
!
!$acc kernels present(domain)
!
! *** Main compute routine inside 2 nested OMP loops
! *** *****************************************************************
!
!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,nparts,nsub,
!$OMP2 istep,isweep,istage,icomput)
!$acc loop collapse(2)
do ipart=1,nparts
!
!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,ipart,nparts,nsub,
!$OMP2 istep,isweep,istage,icomput)
do isub=1,nsub
call Nxo_compute(istep,isweep,istage,icomput,
1 ipart,isub,nparts,nsub,
2 domain,domain(ipart)%subdomain(isub),
3 domain(ipart)%subdomain(isub)%problem)
enddo
!$OMP END PARALLEL DO
enddo
!$OMP END PARALLEL DO
!$acc end kernels
!

The code is rather complex, here are some examples of the loops inside the Nxo_compute routinec Characteristic upwind scheme *************************************************************
if(icomput.eq.1) then
!$acc loop worker
do iblob_sf = 1,nblob_gon
c
icoel_strt = subdomain%nelint(iblob_sf-1)
nbvf_ste = subdomain%nelint(iblob_sf) - icoel_strt
c
ivf_lft = subdomain%intcel(icoel_strt+1)
ivf_rgh = subdomain%intcel(icoel_strt+2)
if(ivf_rgh.le.0) ivf_rgh = nblob_vf+2
c
nelint = min(nbvf_ste,nvect)
nel = 4*( (nelint-1)/4+1)
cph3g(1:nel) = 0.
cph1g(1:nel) = 0.
cph3d(1:nel) = 0.
cph1d(1:nel) = 0.
cgpx(1:nel) = 0.
cgpy(1:nel) = 0.
cgpz(1:nel) = 0.
cddd(1:nel) = 0.
rtbl(1:nel) = 0.
utbl(1:nel) = 0.
vtbl(1:nel) = 0.
wtbl(1:nel) = 0.
etbl(1:nel) = 0.
sabl(1:nel) = 0.
c
c *** Gather the coefficients of the spatial scheme for each cell in the stencil
c
#ifdef _OPENACC
!$acc loop vector
do iste=1,nelint
icoel = icoel_strt +iste
cph3g(iste) = subdomain%coe_gpu(icoel,1)
cph1g(iste) = subdomain%coe_gpu(icoel,2)
cph3d(iste) = subdomain%coe_gpu(icoel,3)
cph1d(iste) = subdomain%coe_gpu(icoel,4)
cgpx(iste) = subdomain%coe_gpu(icoel,5)
cgpy(iste) = subdomain%coe_gpu(icoel,6)
cgpz(iste) = subdomain%coe_gpu(icoel,7)
cddd(iste) = subdomain%coe_gpu(icoel,8)
enddo
#else
do iste=1,nelint
icoel = icoel_strt +iste
coeffs = subdomain%coeint(icoel)
cph3g(iste) = coeffs%cph3g
cph1g(iste) = coeffs%cph1g
cph3d(iste) = coeffs%cph3d
cph1d(iste) = coeffs%cph1d
cgpx(iste) = coeffs%cgpx
cgpy(iste) = coeffs%cgpy
cgpz(iste) = coeffs%cgpz
cddd(iste) = coeffs%cddd
enddo
#endif
c
c *** Gather the conservative varaible fields in tne stencil FVs
c
!$acc loop vector
do iste=1,nelint
icoel = icoel_strt +iste
c
iblob = subdomain%intcel(icoel)
if(iblob.gt.0 .and. iblob.le.nblob_vf) then
blob_vf_rcv = subdomain%blob_vf(iblob)
else if(iblob.lt.0) then
index_snd = -iblob
iprt_snd = mod(index_snd,nparts)
index_snd = (index_snd-iprt_snd) / nparts
isub_snd = mod(index_snd,nsub)
nod_snd = (index_snd-isub_snd) / nsub
iprt_snd = iprt_snd+1
isub_snd = isub_snd+1
blob_vf_rcv = domain(iprt_snd)%subdomain(isub_snd)
1 %blob_vf(nod_snd)
endif
rtbl(iste) = blob_vf_rcv%ro
utbl(iste) = blob_vf_rcv%u
vtbl(iste) = blob_vf_rcv%v
wtbl(iste) = blob_vf_rcv%w
etbl(iste) = blob_vf_rcv%eto
sabl(iste) = blob_vf_rcv%samu
if(ipart.eq.5.and.isub.eq.7.and.iblob_sf.eq.25) print*,
1 ’ coefs’,cph1g(iste),cph3g(iste),rtbl(iste),
2 etbl(iste)
if(iste.eq.1) then
vol_lft = blob_vf_rcv%vol
disw_lft = blob_vf_rcv%disw
ichunk_lft = subdomain%mtlev(ivf_lft)
nbgh_lft = mod(ichunk_lft,100)
endif
if(iste.eq.2) then
iblob_rgh = iblob
vol_rgh = blob_vf_rcv%vol
disw_rgh = blob_vf_rcv%disw
if(iblob.gt.0) then
ichunk_rgh = subdomain%mtlev(ivf_rgh)
else
ichunk_rgh = domain(iprt_snd)%subdomain(isub_snd)
1 %mtlev(nod_snd)
endif
nbgh_rgh = mod(ichunk_rgh,100)
endif
c
enddo
ekbl(1:nel) = 0.5*(utbl(1:nel)*utbl(1:nel) +
1 vtbl(1:nel)*vtbl(1:nel) +
2 wtbl(1:nel)*wtbl(1:nel))
eibl(1:nel) = etbl(1:nel) - ekbl(1:nel)
htbl(1:nel) = eibl(1:nel)*gammgp + ekbl(1:nel)
prbl(1:nel) = rtbl(1:nel)gammg1eibl(1:nel)
c

The worker loop is on the surfaces : iblob_sf. Each has a stencil of variable size nelint, we do vector processing on the elements of these stencils, either explicit !$acc loop vector or implicit a(1:nel) = b(1:nel)*c(1:nel)

then there are reductions over the stencils :

        roo3g = sum(cph3g(1:nel) * rtbl(1:nel))
        rou3g = sum(cph3g(1:nel) * rtbl(1:nel)*utbl(1:nel))
        rov3g = sum(cph3g(1:nel) * rtbl(1:nel)*vtbl(1:nel))
        row3g = sum(cph3g(1:nel) * rtbl(1:nel)*wtbl(1:nel))
        roe3g = sum(cph3g(1:nel) * rtbl(1:nel)*etbl(1:nel))

and a lot of non-vector computations using these reduced values at the face, the results are then scattered to the left and right cells on either side of the face :.


subdomain%blob_rhs(ivf_lft)%ro =
1 subdomain%blob_rhs(ivf_lft)%ro + flfr
subdomain%blob_rhs(ivf_rgh)%ro =
1 subdomain%blob_rhs(ivf_rgh)%ro - flfr

It seems that these last computations are repeated by all the vector threadsidx.x, although the vector loops and the stencils reductions are terminated
c

Thanks
Jean-Marie

Hi Jean-Marie,

This is just a guess, but it looks like you have several temp arrays, like:

cph3g(1:nel) = 0.
cph1g(1:nel) = 0.
cph3d(1:nel) = 0.

Are this automatics? Automatics are implicitly allocated upon entry into the subroutine. While we support this, allocation is serialized so can cause performance issues. Also, the default heap size on the device is quite small. While this can be increased by calling cudaDeviceSetLimit and it seems that you’re not currently running out of heap (you’d get a runtime error if you did), it could be a problem with larger sizes.

Secondly, it doesn’t look like these arrays are private to the workers since you don’t have a “private” clause on the worker loop. Hence, these arrays are shared by all workers which could cause a race condition and likely the cause of the incorrect answers.

What I’d recommend trying is creating a copy of the Nxo_compute routine and then put the outer “isub” loop within this new subroutine (adjust the arguments accordingly). This will allow you to hoist the automatics off the device. Then add these arrays to a private clause on the worker loop so each worker has their own copy.

-Mat

Thanks Mat,
The idea in placing the routine call only in the most external loop on domains (index ipart) is to launch fewer instances in parallel and access more ressources for private arrays ?
There is a need for 21 private vectors of length at most nvect=128 (average is around 60 but it varies a lot among subdomains)
They are declared in the main

 parameter(nvect=128)
  real  rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect)
  real  etbl(nvect),sabl(nvect),riem(nvect)
  real  eibl(nvect),ekbl(nvect),htbl(nvect),prbl(nvect)
  real  flur(nvect)

  real*4  cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect)
  real*4  cgpx(nvect),cgpy(nvect),cgpz(nvect),cddd(nvect)
  real*4  cphgdp(nvect)

Then the routine is called with :
!
!$acc kernels present(domain)
!
! *** Main compute routine inside 1 OMP loop
! *** *****************************************************************
!
!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,nparts,istep,isweep,istage,icomput)

!$acc loop gang private(rtbl,utbl,vtbl,wtbl,etbl,sabl,riem,eibl,
!$acc1 ekbl,htbl,prbl,flur,cph3g,cph1g,cph3d,cph1d,
!$acc2 cgpx,cgpy,cgpz,cddd,cphgdp)

              do ipart=1,nparts
                 call Nxo_compute(istep,isweep,istage,icomput,
 1                                               ipart,nparts,
 2                                       domain,domain(ipart))
              enddo

!$OMP END PARALLEL DO
!
!$acc end kernels
!
Is it what you call hoist the automatics off the device ?

Jean-Marie

Well the execution proceeds but the results are different.
Are all the local variable in the acc routine private by default ?
Also how can we declare that the private arrays (vectors) in this routine are for the most inner vector loops and should be treated as such (each element a register of the ThreadIdx.x thread core) ?
Thanks

Hi Jean-Marie,

The idea in placing the routine call only in the most external loop on domains (index ipart) is to launch fewer instances in parallel and access more ressources for private arrays ?

Sorry, but I’m not quite understanding the question. My suggestion above was mainly to avoid device side allocation. With the additional code, it looks like these are fixed size, not automatics, which is better, but they will need to be private to each worker, not gang. However since the number of workers won’t be known until runtime, the private arrays will still need to be allocated.

My suggestion is basically not use a call here. Instead, move the “ipart” loop into the Nxo_compute (alternately, inline Nxo_compute into the ipart loop, but this would take more work). This way the private arrays can be allocated prior to launching the kernel instead of during the kernels execution.

Of course, this is just a suggestion and since I only have a very limited view of your program. I’m focused on the wrong answers which often due to race conditions (likely due to the temp arrays not being private to the workers).

-Mat

Hi, Mat,
I am back on this project.
If I declare the arrays as private before entering the collapsed loops with the routine call, and declare them as arguments to the subroutine, the code runs, but only some of these private arrays are used correctly, the other ones contain data messed up.

Here is the main program (nvect=128 is a parameter in the module):

!
!$acc kernels present(domain)
!
c *** Main compute routine inside 2 nested OMP loops
c *** *****************************************************************
C
!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,nparts,nsub,
!$OMP2 istep,isweep,istage,icomput)

!$acc loop gang collapse(2)
!$acc1 private(rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect),
!$acc2 etbl(nvect),sabl(nvect),eibl(nvect),ekbl(nvect),
!$acc3 htbl(nvect),prbl(nvect),
!$acc4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect),
!$acc5 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect) )

              do ipart=1,nparts

!$OMP PARALLEL DO DEFAULT(PRIVATE)
!$OMP1 SCHEDULE(STATIC,1)
!$OMP1 SHARED(domain,ipart,nparts,nsub,
!$OMP2 istep,isweep,istage,icomput)

                 do isub=1,nsub

                    call Nxo_compute(istep,isweep,istage,icomput,
 1                                        ipart,isub,nparts,nsub,
 2                          domain,domain(ipart)%subdomain(isub),
 3                         domain(ipart)%subdomain(isub)%problem,
 4             rtbl,utbl,vtbl,wtbl,etbl,sabl,eibl,ekbl,htbl,prbl,
 5                  cph3g,cph1g,cph3d,cph1d,cgpx,cgpy,cgpz,cddd )

                 enddo

!$OMP END PARALLEL DO
enddo
!$OMP END PARALLEL DO
c
!$acc end kernels

The subroutine header is :
c ******************************************************************
subroutine Nxo_compute(istep,isweep,istage,icomput,
1 ipart,isub,nparts,nsub,
2 domain,subdomain,problem,
4 rtbl,utbl,vtbl,wtbl,etbl,sabl,eibl,ekbl,htbl,prbl,
5 cph3g,cph1g,cph3d,cph1d,cgpx,cgpy,cgpz,cddd )

c
!$acc routine vector
c
USE Nxo_commun
USE omp_lib
#ifdef _OPENACC
use openacc
use cudafor
c use wmma
#endif
use ieee_arithmetic
c
integer istep,isweep,istage,icomput,ipart,nparts,nsub
type(domain_type) domain(nparts)
type(subdomain_type) subdomain
type(problem_type) problem
c
type(vf_type) blob_vf_rcv
type(coeffs_type) coeffs

  real  rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect)
  real  etbl(nvect),sabl(nvect)
  real  eibl(nvect),ekbl(nvect),htbl(nvect),prbl(nvect)

c
real4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect)
real
4 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect)
c
c *** Sub-domain Solver
c
c *** data for the sequential loop (not vectorized, how to place it in shared memory ?)
c
sa_cb1 = 0.1355
sa_sigma = 2./3.

…Here is how the private arrays are used

c *** Gather the coefficients of the spatial scheme
c
#ifdef _OPENACC

c *** Coalesced data access on GPU

!$acc loop vector
do iste=1,nelint
icoel = icoel_strt +iste
cph3g(iste) = subdomain%coe_gpu(icoel,1)
cph1g(iste) = subdomain%coe_gpu(icoel,2)
cph3d(iste) = subdomain%coe_gpu(icoel,3)
cph1d(iste) = subdomain%coe_gpu(icoel,4)
cgpx(iste) = subdomain%coe_gpu(icoel,5)
cgpy(iste) = subdomain%coe_gpu(icoel,6)
cgpz(iste) = subdomain%coe_gpu(icoel,7)
cddd(iste) = subdomain%coe_gpu(icoel,8)
enddo
#else
do iste=1,nelint
icoel = icoel_strt +iste
coeffs = subdomain%coeint(icoel)
cph3g(iste) = coeffs%cph3g
cph1g(iste) = coeffs%cph1g
cph3d(iste) = coeffs%cph3d
cph1d(iste) = coeffs%cph1d
cgpx(iste) = coeffs%cgpx
cgpy(iste) = coeffs%cgpy
cgpz(iste) = coeffs%cgpz
cddd(iste) = coeffs%cddd
enddo
#endif
c
c *** Gather the conservative variable fields in tne stencil FVs
c
!$acc loop vector
do iste=1,nelint
icoel = icoel_strt +iste
c
iblob = subdomain%intcel(icoel)
if(iblob.gt.0 .and. iblob.le.nblob_vf) then
blob_vf_rcv = subdomain%blob_vf(iblob)
else if(iblob.lt.0) then
index_snd = -iblob
iprt_snd = mod(index_snd,nparts)
index_snd = (index_snd-iprt_snd) / nparts
isub_snd = mod(index_snd,nsub)
nod_snd = (index_snd-isub_snd) / nsub
iprt_snd = iprt_snd+1
isub_snd = isub_snd+1
blob_vf_rcv = domain(iprt_snd)%subdomain(isub_snd)
1 %blob_vf(nod_snd)
endif
rtbl(iste) = blob_vf_rcv%ro
utbl(iste) = blob_vf_rcv%u
vtbl(iste) = blob_vf_rcv%v
wtbl(iste) = blob_vf_rcv%w
etbl(iste) = blob_vf_rcv%eto
sabl(iste) = blob_vf_rcv%samu
if(iste.eq.1) then
vol_lft = blob_vf_rcv%vol
disw_lft = blob_vf_rcv%disw
ichunk_lft = subdomain%mtlev(ivf_lft)
nbgh_lft = mod(ichunk_lft,100)
endif
if(iste.eq.2) then
vol_rgh = blob_vf_rcv%vol
disw_rgh = blob_vf_rcv%disw
if(iblob.gt.0) then
ichunk_rgh = subdomain%mtlev(ivf_rgh)
else
ichunk_rgh = domain(iprt_snd)%subdomain(isub_snd)
1 %mtlev(nod_snd)
endif
nbgh_rgh = mod(ichunk_rgh,100)
endif
c
enddo
!

These private arrays (vectors) should be vectorized for the coalesced data transfer (1st phase, gather the coefficients of the spatial scheme).
The second phase cannot use coalesced data transfer since there is an indirection (unstructured grid in this subdomain and/or access to data in other subdomains) .
Is adding the private vectors to the argument list correct ? Some prints in the parallel routine show that the first phase is correct.
For the second one, all the first 32 values in vector rtbl for example are the same, then 33 to 64,…
Thanks for your help,

jean-Marie

Hi Jean-Marie,

I do see one potential issue. While scalars are private by default, aggregate types are not. Hence “blob_vf_rcv” will be shared amongst the vectors. You’ll want to privatized it.

!$acc loop vector private(blob_vf_rcv)
do iste=1,nelint
icoel = icoel_strt +iste

Also, I don’t see where “iblob”, “icoel”, the “*snd", "rgh", and "_lft” variables are declared. Are these “Nxo_commun” module variables, implicitly declared, or just omitted from the snip-it? If in a module then they are global variables and hence shared by all threads. They too would then need to privatized on the vector loop or declared locally in the routine.

-Mat

Hi, Mat,
The aggregate type not being private to the vector loop indices could be a first explanation, I will check it. the other variables that you mention are local variables to the routine so they should be private to it, and if they appear in the vector loops they are private to the loop indices by default ?

Is there a way that the private vectors are " vectorized", the compiler says they will be placed in the shared memory : shared by the inner thread block ? or warp ? I am not sure of the wording.
Thanks,
Jean-Marie

Scalars are private by default. The exceptions being if the scalar is a global variable (i.e. in a module) or passed by address (default in Fortran) to a subroutine.

Is there a way that the private vectors are " vectorized", the compiler says they will be placed in the shared memory : shared by the inner thread block ?

The compiler will attempt to place gang private arrays in shared memory assuming the size is known at compile time and they fit. However for this program since that are so many gang private arrays, it’s unlikely that the compiler will be able to put them in shared memory. If you add the “-Minfo=accel” flag, the compiler feedback messages would print “CUDA shared memory used for ” if it was able to use shared memory.

There is the OpenACC “cache” directive where your can specify which arrays to place in shared memory. It’s been problematic to implement, but you can try this as well for a few key private arrays, presuming that they fit. Though keep in mind, using a lot of shared memory can reduce the kernel’s occupancy and therefor impact performance.

-Mat

Mat,

The compiler accepts that the aggregate type blob_vf_rcv be private.
But there is now an error message :

call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)
call to cuMemFreeHost returned error 719: Launch failed (often invalid pointer dereference)

c
c *** Gather the conservative variable fields in tne stencil FVs
c
!$acc loop vector private(blob_vf_rcv)
!!!$acc loop seq
do iste=1,nelint
icoel = icoel_strt +iste
c
iblob = subdomain%intcel(icoel)
if(iblob.gt.0 .and. iblob.le.nblob_vf) then
blob_vf_rcv = subdomain%blob_vf(iblob)
else if(iblob.lt.0) then
index_snd = -iblob
iprt_snd = mod(index_snd,nparts)
index_snd = (index_snd-iprt_snd) / nparts
isub_snd = mod(index_snd,nsub)
nod_snd = (index_snd-isub_snd) / nsub
iprt_snd = iprt_snd+1
isub_snd = isub_snd+1
blob_vf_rcv = domain(iprt_snd)%subdomain(isub_snd)
1 %blob_vf(nod_snd)
endif
rtbl(iste) = blob_vf_rcv%ro
utbl(iste) = blob_vf_rcv%u
vtbl(iste) = blob_vf_rcv%v
wtbl(iste) = blob_vf_rcv%w
etbl(iste) = blob_vf_rcv%eto
sabl(iste) = blob_vf_rcv%samu
if(iste.eq.1) then
vol_lft = blob_vf_rcv%vol
disw_lft = blob_vf_rcv%disw
ichunk_lft = subdomain%mtlev(ivf_lft)
nbgh_lft = mod(ichunk_lft,100)
endif
if(iste.eq.2) then
vol_rgh = blob_vf_rcv%vol
disw_rgh = blob_vf_rcv%disw
if(iblob.gt.0) then
ichunk_rgh = subdomain%mtlev(ivf_rgh)
else
ichunk_rgh = domain(iprt_snd)%subdomain(isub_snd)
1 %mtlev(nod_snd)
endif
nbgh_rgh = mod(ichunk_rgh,100)
endif
c
enddo

The code runs fine if this loop is declared sequential, which means that the private vectors rtbl, utbl,… are private to the routine, but we are still slow.
This loop is crucial for performance, this is where most of the data of the stencils is accessed, but Is it worth having a vector treatment of this loop, with the indirection subdomain%intcel(icoel) , the branching depending on the sign of iblob, special treament for indices 1 and 2. ?

Placing the vectors in cache with the pragma acc cache did not work. I coded this way :

  real  rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect)
  real  etbl(nvect),sabl(nvect)
  real  eibl(nvect),ekbl(nvect),htbl(nvect),prbl(nvect)

c
real4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect)
real
4 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect)

!!!$acc cache(rtbl,utbl,vtbl,wtbl,
!!!$acc2 etbl,sabl,eibl,ekbl,
!!!$acc3 htbl,prbl,
!!!$acc4 cph3g,cph1g,cph3d,cph1d,
!!!$acc5 cgpx, cgpy, cgpz, cddd )
c

after the declaration at the entry to the routine

PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol - cddd (Nxo_compute_pgi.f: 1)
nxo_compute:
2, Generating Tesla code
2, Cached references to size [128] block of cgpy,cgpx,cddd,cgpz,cph1g,cph1d,cph3d,htbl,etbl,eibl,cph3g,rtbl,ekbl,wtbl,vtbl,utbl,sabl,prbl

The total size of these 18 vectors is 14336 Bytes, for the max stencil size of 128 (most applications require only 64), I infer that it is important for performance that they are vectorized (one vector index in the register of each core of the threadblock).

Thanks
Jean-Marie

Hi Jean-Marie,

In the past we have had issues with direct copy of aggregate types. While we have fixed the instances I’m aware of, it may be possible that the same issue is occurring here. No idea if this is indeed the case, but you might try assignment of the individual data member to see if it works around the error. Otherwise, I’d need a reproducible example to understand where the issue may be occurring.

PGF90-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): Could not find allocated-variable index for symbol - cddd (Nxo_compute_pgi.f: 1)

Again, the “cache” directive has been problematic to implement, but I believe the correct placement for this would be in the main routine, just before the call. No idea if this will work around the error, but worth a try.

The available shared memory size is 49152 so they should fit, however it will lower your theoretical occupancy to 33%. It may be better for performance to not use shared memory (software managed cache) and instead rely on hardware L2 caching. Early NVIDIA devices relied heavily on software managed cache, but as hardware caching improved, it’s no longer as important. Not to say that it can’t help, it can, just that you may not see gains given the amount used and this causing fewer blocks are able to run simultaneously on a SM.

-Mat

Hi,
I have switched to the latest HPC SDK, I was using the fortran pgf90 with pgi version 18 until now.
With nvfortran, I have other error messages, like : passing arguments by reference prevents parallelization of the loops with the call to the compute intensive acc routine.
I c *** ***************************************************
interface
subroutine Nxo_compute(ipart,isub,nparts,nsub,
1 istep,isweep,istage,icomput,
2 domain,subdomain,problem,
4 rtbl,utbl,vtbl,wtbl,etbl,sabl,
5 eibl,ekbl,htbl,prbl,
6 cph3g,cph1g,cph3d,cph1d,cgpx,cgpy,cgpz,cddd)

c
!$acc routine vector
!!!$acc1 private(rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect),
!!!$acc2 etbl(nvect),sabl(nvect),eibl(nvect),ekbl(nvect),
!!!$acc3 htbl(nvect),prbl(nvect),
!!!$acc4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect),
!!!$acc5 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect) )
c
USE Nxo_commun
USE omp_lib
#ifdef _OPENACC
use openacc
use cudafor
c use wmma
#endif
use ieee_arithmetic
c
real rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect)
real etbl(nvect),sabl(nvect)
real eibl(nvect),ekbl(nvect),htbl(nvect),prbl(nvect)
c
real4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect)
real
4 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect)
c
integer,value :: nparts,nsub
integer,value :: istep,isweep,istage,icomput
type( domain_type) domain(nparts)
type(subdomain_type) subdomain
type( problem_type) problem
c
end subroutine
end interface
c *** ***************************************************

I declared these integers with the ‘value’'string and I obtained an executable.
integer,value :: nparts,nsub
integer,value :: istep,isweep,istage,icomput

Now the 2 loop indices ipart and isub of the collapsed loop are also in the argument list, should they also be declared as ‘value’.

The code runs, but the private vectors rtbl(nvect),… are overwritten, they don’t seem to be private any more.
Here is the beginning of the source code for the routine, It was giving right results with the old pgi.18, although very slow, the routine execution might not have been parallel.
c ******************************************************************
subroutine Nxo_compute(ipart,isub,nparts,nsub,
1 istep,isweep,istage,icomput,
2 domain,subdomain,problem,
4 rtbl,utbl,vtbl,wtbl,etbl,sabl,
5 eibl,ekbl,htbl,prbl,
6 cph3g,cph1g,cph3d,cph1d,cgpx,cgpy,cgpz,cddd)

c
!$acc routine vector
c
USE Nxo_commun
USE omp_lib
#ifdef _OPENACC
use openacc
use cudafor
c use wmma
#endif
use ieee_arithmetic
c
integer ipart,isub,nparts,nsub
integer istep,isweep,istage,icomput
type(domain_type) domain(nparts)
type(subdomain_type) subdomain
type(problem_type) problem
c
type(vf_type) blob_vf_rcv
type(coeffs_type) coeffs
c
c *** Bnd treatment tags : 1 = no slip wall, 2 = far_field, 3 = symmetry
c
CCCDIR$ ATTRIBUTES ALIGN:64 :: rtbl,utbl,vtbl,wtbl,etbl,sabl
CCCDIR$ ATTRIBUTES ALIGN:64 :: eibl,ekbl,htbl,prbl
CCCDIR$ ATTRIBUTES ALIGN:64 :: cph3g,cph1g,cph3d,cph1d
CCCDIR$ ATTRIBUTES ALIGN:64 :: cgpx,cgpy,cgpz,cddd
CCCDIR$ ATTRIBUTES ALIGN:64 :: blob_vf_rcv, coeffs
c
real rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect)
real etbl(nvect),sabl(nvect)
real eibl(nvect),ekbl(nvect),htbl(nvect),prbl(nvect)
c
real4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect)
real
4 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect)

!!!$acc cache(rtbl,utbl,vtbl,wtbl,etbl,sabl,eibl,ekbl,htbl,prbl,
!!!$acc1 cph3g,cph1g,cph3d,cph1d,cgpx,cgpy,cgpz,cddd)
c
c *** Sub-domain Solver
c
c *** data for the sequential loop (not vectorized, shared memory ?)
c
sa_cb1 = 0.1355
sa_sigma = 2./3.
aprtsa = 1./sa_sigma
sa_cb2 = 0.622

Thanks,

jean-Marie

Makes sense. If the loop index variables are passed by reference then they could be changed. Using the “value” attribute is the correct solutions.

Though, don’t you also need to add the value attribute to the subroutine’s definition as well, not just the interface? Otherwise the caller will pass it in by value, but the callee will be expecting to be by reference.

Now the 2 loop indices ipart and isub of the collapsed loop are also in the argument list, should they also be declared as ‘value’.

In general, I’ll use “value” on all read-only scalars being passed to a device routine. Besides preventing parallelization when passing in loop indices, passing by reference can inhibit the compiler’s ability to implicitly privatize the scalar since the compiler must assume a global reference is taken. In most cases, it wont matter correctness-wise, but may slightly impact performance as the variable needs to be fetched from global memory rather than passed on the stack.

The code runs, but the private vectors rtbl(nvect),… are overwritten, they don’t seem to be private any more.

How are to determining that the arrays aren’t getting privatized?

I assume you still have the private clause with these arrays on the outer “gang” loop and are not using “cache”? If so, it’s very likely that the these array are still gang private and any issues would be due to some other factor. Granted, I don’t have access to the code, hence I can’t be sure, nor know what would be causing it.

One favor to ask is that you encapsulate code snip-its within a preformatted text region (i.e. the “</>” symbol). Code in free text can be somewhat challenging to read.

-Mat

Hi, Thanks for your answers,

I have verified the coding with respect to the older version that worked OK with the PGI version 18 compiler, and found the explanation.
In the former version, the loop in the Compute routine where we gather data from the stencil with the unstructured indirection, with an aggregate type blob_vf_rcv, was declared sequential, so it worked.
To have it parallel, I added an array of such aggegate type, with the size of vectors nvect, before calling the routine, like the other private vectors.
The header is now :
<c ******************************************************************
subroutine Nxo_compute(istep,isweep,istage,icomput,
1 ipart,isub,nparts,nsub,
2 domain,subdomain,problem,
3 rtbl,utbl,vtbl,wtbl,etbl,sabl,eibl,ekbl,htbl,prbl,
4 cph3g,cph1g,cph3d,cph1d,cgpx,cgpy,cgpz,cddd,blob_rcv)

c
!$acc routine vector
c
USE Nxo_commun
USE omp_lib
ifdef _OPENACC
use openacc
use cudafor
c use wmma
endif
use ieee_arithmetic
c
integer,value :: istep,isweep,istage,icomput
integer,value :: isub,ipart,nsub,nparts
type(domain_type) domain(nparts)
type(subdomain_type) subdomain
type(problem_type) problem
c
type(vf_type) blob_rcv(nvect)
type(coeffs_type) coeffs
c
real rtbl(nvect),utbl(nvect),vtbl(nvect),wtbl(nvect)
real etbl(nvect),sabl(nvect)
real eibl(nvect),ekbl(nvect),htbl(nvect),prbl(nvect)
c
real4 cph3g(nvect),cph1g(nvect),cph3d(nvect),cph1d(nvect)
real
4 cgpx(nvect), cgpy(nvect), cgpz(nvect), cddd(nvect)

All these vectors are declared private and the code runs OK, but still slow.

An issue is that this Compute routine has an initial part that is vectorized, with stencil computations and a final reduction of stencil volume data towards the interface in the middle of it. Then this followed by a number of computations on this reduced data.
The outer loop in this routine is sequential over these interfaces, it cannot be made worker, this would require too much private data. Is there a way that we could switch the context in this same routine, with the inner vector loop being then on the interfaces, with arrays of reduced data, and without a too huge footprint on shared memory ?
This is complex and might require a GPU Hackaton, but since I am developping this code on my own…

Thanks,
Jean-Marie

Hi Jean-Marie,

Would it be possible for you to share a complete example such as a miniApp with the main compute routine extracted with driver code? My guess is the code is proprietary so couldn’t be posted publicly, but we can arrange an alternate method for you to send it to me.

A hackathon would be beneficial since we could then assign a mentor that’s an expert in CFD codes. I only have limited experience in CFD so may not be able offer the best advice on how to refactor the code. Though with a miniApp, I could take a first pass, and then bring in others who do have this experience.

-Mat

Hi, Mat,
Thanks for your suggestion. I have prepared a miniApp with all the representative parts of the code, both a CPU and an ACC version on a single node and single accelerator. In fact I already transmitted a former version to François Courteille of NVIDIA France, with whom I am in contact since many years. I could transmit it to you, this miniApp is an extract of a research code so there is no protection associated to it.

Regards,
Jean-Marie

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.