Questions on MPI direct memory address with Openacc

snailsoar · April 16, 2022, 5:41pm

Hello,

I am learning to use multiple GPU with openacc and MPI and have some questions on the direct memory access.

In the attached mini app. I have two ranks. rank 0 do some “computation” and pickle the data to binary stream on one GPU, and then send the data to rank 1. Rank 1 is on a second GPU. It receives the data and unpickle the data and display it on the host.

For Rank 1, it seems to me that the data of “rbuf” is correctly received, but somehow I need to update the host copy of “rbuf”. My question is that : Do the “direct memory address” receive the data and update the data both in the device and the host or only in the device and I have to do a manual update for the data in the host?

I am using sdk22.3 and I am using two Quardo P400 cards.

Thanks,
Feng

source code and compile script:

main.cpp (4.1 KB)
compile.sh (67 Bytes)

MatColgrove · April 18, 2022, 4:19pm

Hi Feng,

In this case since you’re passing in a device pointer to the MPI calls (via the host_data construct), CUDA Aware MPI will copy the rank data directly between the devices. Hence, if you need the data back on the host, you do need to explicitly copy it back via the “update self” directive.

Given “rbuf” isn’t used on the device, you probably don’t need to add it to any OpenACC directives at all, nor send it via CUDA Aware MPI. Though, I’m assuming you’re using this as a test code, so you may need it later for your full application. In which case it’s fine.

-Mat

snailsoar · April 18, 2022, 5:05pm

Hi Mat,

Thanks for your reply and confirmation!

Thanks,
Feng

snailsoar · April 20, 2022, 3:09pm

Hi Mat,

I have updated my mini app a bit so that a series of messages are sent/received between two GPUs. I am still replying on the direct memory address, but it seems the messages are not correctly received. The message is now sent like:

      for(ia=0; ia<na; ia++)
     {
         for(id=0; id<ncpu; id++)
        {
           #pragma acc host_data use_device (sbuf[ia][id])
           {
               MPI_Isend( sbuf[ia][id],ssize,MPI_BYTE,1,tag[ia][id],MPI_COMM_WORLD,&(sreq[ia][id]));
           }
        }
         cout << "rank " << rank << " send message with size " << ssize << "\n";
     }

and the message is received and checked like:

      for(ia=0; ia<na; ia++)
     {
         for(id=0; id<ncpu; id++)
        {
           #pragma acc host_data use_device (rbuf[ia][id])
           {
               MPI_Irecv( rbuf[ia][id],rsize,MPI_BYTE,0,tag[ia][id],MPI_COMM_WORLD,&(rreq[ia][id]));
           }

            MPI_Wait( &(rreq[ia][id]),&status );
           #pragma acc update self(rbuf[ia][id][0:rsize]) 
            cout << "rank " << rank << " receive message with size " << rsize << "\n";
            len=0;
            unsafeunpckle( &len, nv, nq, rdata, rbuf[ia][id] );

            for(int i=0; i<nv*nq; i++) cout << rdata[i] << " check rbuf ==== \n";
        }
     }

Have I missed anything in my code?

Many thanks in advance.
Feng

The completed code is:
main.cpp (6.5 KB)

MatColgrove · April 20, 2022, 6:34pm

Hi Feng,

Two issues. First, the triplet notation issue with the update directive as noted in your other post. Second, we don’t support sub-components of aggregate types in “host_data” so you’ll want to use a temp pointer. For example:

                pickle_t sptr = sbuf[ia][id];
               #pragma acc host_data use_device (sptr)
               {
                   MPI_Isend( sptr,ssize,MPI_BYTE,1,tag[ia][id],MPI_COMM_WORLD,&(sreq[ia][id]));
               }

Full changes in the attached file:
main.cpp (6.6 KB)

% mpicxx -acc main.cpp -Minfo=accel -cuda -V22.3
compute(int, int, double *):
     51, Generating present(sdata[:nq*nv])
         Generating NVIDIA GPU code
         51, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
main:
    101, Generating enter data copyin(sbuf[:na])
    106, Generating enter data copyin(sbuf->[:ncpu])
    110, Generating enter data copyin(sbuf->->[:ssize])
    116, Generating enter data copyin(sdata[:nq*nv])
    125, Generating update self(sbuf[ia][id][:ssize])
    166, Generating enter data copyin(rbuf[:na])
    171, Generating enter data copyin(rbuf->[:ncpu])
    175, Generating enter data copyin(rbuf->->[:rsize])
    198, Generating update self(rbuf[ia][id][:rsize])
    214, Generating exit data delete(sbuf->->[:ssize])
    216, Generating exit data delete(sbuf->[:ncpu])
    218, Generating exit data delete(sbuf[:na])
    226, Generating exit data delete(rbuf->->[:ssize])
    228, Generating exit data delete(rbuf->[:ncpu])
    230, Generating exit data delete(rbuf[:na])
void pcklegpu<double>(unsigned long *, int, int, const T1 *, char *):
     25, Generating enter data create(tmp[:nq*nv])
         Generating present(var[:nq*nv],tmp[:nq*nv])
         Generating NVIDIA GPU code
         25, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
     35, Generating exit data delete(tmp[:nq*nv])
% mpirun -np 2 ./a.out
Num GPUs: 8
rank 1 use GPU 1
Num GPUs: 8
rank 0 use GPU 0
-0 check sbuf before sending====
-10000 check sbuf before sending====
-20000 check sbuf before sending====
-30000 check sbuf before sending====
-40000 check sbuf before sending====
-0 check sbuf before sending====
-10000 check sbuf before sending====
-20000 check sbuf before sending====
-30000 check sbuf before sending====
-40000 check sbuf before sending====
-0 check sbuf before sending====
-10000 check sbuf before sending====
-20000 check sbuf before sending====
-30000 check sbuf before sending====
-40000 check sbuf before sending====
-0 check sbuf before sending====
-10000 check sbuf before sending====
-20000 check sbuf before sending====
-30000 check sbuf before sending====
-40000 check sbuf before sending====
-0 check sbuf before sending====
-10000 check sbuf before sending====
-20000 check sbuf before sending====
-30000 check sbuf before sending====
-40000 check sbuf before sending====
-0 check sbuf before sending====
-10000 check sbuf before sending====
-20000 check sbuf before sending====
-30000 check sbuf before sending====
-40000 check sbuf before sending====
rank 0 send message with size 40
rank 0 send message with size 40
rank 0 send message with size 40
----------------finish sending message-----------------------
rank 1 receive message with size 40
-0 check rbuf ====
-10000 check rbuf ====
-20000 check rbuf ====
-30000 check rbuf ====
-40000 check rbuf ====
rank 1 receive message with size 40
-0 check rbuf ====
-10000 check rbuf ====
-20000 check rbuf ====
-30000 check rbuf ====
-40000 check rbuf ====
rank 1 receive message with size 40
-0 check rbuf ====
-10000 check rbuf ====
-20000 check rbuf ====
-30000 check rbuf ====
-40000 check rbuf ====
rank 1 receive message with size 40
-0 check rbuf ====
-10000 check rbuf ====
-20000 check rbuf ====
-30000 check rbuf ====
-40000 check rbuf ====
rank 1 receive message with size 40
-0 check rbuf ====
-10000 check rbuf ====
-20000 check rbuf ====
-30000 check rbuf ====
-40000 check rbuf ====
rank 1 receive message with size 40
-0 check rbuf ====
-10000 check rbuf ====
-20000 check rbuf ====
-30000 check rbuf ====
-40000 check rbuf ====

snailsoar · April 20, 2022, 9:44pm

Yes, I have tried your approach to use a temporary pointer, it works!
Once again many thanks for your help!

system · May 4, 2022, 9:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.