Hi Ilkhom,
In our MPI+OMP hybrid code we do such tricks. Module variables are global within a node with shared memory and communications between nodes is achieved via MPI.
No worries. I just wanted to make sure there wasn’t something in our documentation that needs to be clarified.
Note that I generally recommend using MPI+OpenACC for multi-gpu programming instead of OpenMP+OpenACC. It’s more straight forward since there’s a one-to-one association between the rank and the GPU as opposed to the one-to-many with OpenMP. Trying to manage data across multiple gpus using one host process with many threads is tricky. Plus some MPIs, such as OpenMPI, include GPUdirect where communication can be done between GPUs rather than having to bring data back to the host.
I wrote this article awhile ago on using MPI+OpenACC in Fortran, but it still may be useful: PGI Documentation Archive for Versions Prior to 17.7
There’s also this course: https://developer.nvidia.com/openacc-advanced-course. The examples use C, but the info applies to Fortran as well. It also covers CUDA Aware MPI / GPUDirect.
As for this error, it looks like a bug in the interaction of OpenACC with our older OpenMP runtime when using more than 2 threads. I’ve submitted an issue report (TPR#25976) and sent it to our engineers for investigation.
The good news is that the example works with our newer LLVM based OpenMP runtime. Which version of the compilers are you using? With 18.4, the LLVM compilers are co-installed so you can either set your PATH to “$PGI/linux86-64-llvm/18.4/bin” or add the “-Mllvm” flag.
% pgf90 -ta=tesla:cc70 -mp test.2.F90 -Minfo=accel -Vdev -Mllvm; a.out
main:
31, Generating enter data create(c(:,:),a(:,:))
32, Generating update device(c(:,:),a(:,:))
36, Generating exit data delete(c(:,:),a(:,:))
suma:
0, Accelerator kernel generated
Generating Tesla code
49, Generating present(a(:,:),c(:,:))
50, Loop is parallelizable
Accelerator serial kernel generated
Accelerator kernel generated
Generating Tesla code
50, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
!$acc loop gang, vector(32) ! blockidx%x threadidx%x
Generating implicit reduction(other:a$r)
myid= 3 B= (200.0000000000000,200.0000000000000)
myid= 1 B= (200.0000000000000,200.0000000000000)
myid= 2 B= (200.0000000000000,200.0000000000000)
myid= 0 B= (200.0000000000000,200.0000000000000)
% a.out
myid= 2 B= (200.0000000000000,200.0000000000000)
myid= 3 B= (200.0000000000000,200.0000000000000)
myid= 0 B= (200.0000000000000,200.0000000000000)
myid= 1 B= (200.0000000000000,200.0000000000000)
% a.out
myid= 3 B= (200.0000000000000,200.0000000000000)
myid= 0 B= (200.0000000000000,200.0000000000000)
myid= 2 B= (200.0000000000000,200.0000000000000)
myid= 1 B= (200.0000000000000,200.0000000000000)
Hope this helps,
Mat