Partially present on the device, multiple (more than 3) GPU and multiple (more than 3) MPI_COMM_WORLD

fw407 · May 9, 2022, 5:43pm

Hello,

I have an application that runs on multiple GPUs using MPI and these GPUs can be grouped into different MPI_COMM_WORLD.

The mapping to the real world is that, assume we have an assembly of mechanical components, and each component can be computed by a few GPUs. There are boundary interfaces among the components and the components will exchange information as the simulation goes along and this is done through MPI.

Now for simplicity, assume each component will use only one GPU. I have tested the code on two components and each component use one GPU and it is working fine.

But I have issues when I run with three components. Suppose I am using 3 GPUs (i.e. A, B and C) for each component, now GPU B need to exchange information with A and C, and the code crashes when B is exchanging information with A and C. The error I have is : FATAL ERROR: variable in data clause is partially present on the device: name=stmp

I have attached the code snippet where it crashes, Do I need to do something special when a GPU need to “talk” to another two GPUs simultaneously which are in a different MPI_COMM_WORLD?

Sorry for the long text and If you need more information about the code, I am happy to provide.

Many thanks for your help in advance,
Feng

tmp.cpp (5.8 KB)

MatColgrove · May 9, 2022, 7:03pm

Hi Feng,

A “partially present” error means that the memory already exists on the device but has a different size than what was originally allocated. While there are many causes, often it’s caused when forgetting to put a variable in a “exit data” directive before reallocating on the host.

The error I see in your code is at line 106:

                kdt->build( nx,n, tmp );
                delete[] stmp; stmp=NULL;
                delete[] stmp1; stmp1=NULL;
               #pragma acc exit data delete(stmp,stmp1)

You’re deleting the device data after deleting the host data and setting this variable to NULL. Passing a null pointer to the exit data becomes a no-op so the next time through this section, “stmp” still has an associated device pointer, but with a different size. To fix, move the “exit data” directive before deleting it from the host.

It may be ok if “stmp” and “stmp1” weren’t set to NULL, but it’s always best practice to delete the device data before the host data.

Now for simplicity, assume each component will use only one GPU. I have tested the code on two components and each component use one GPU and it is working fine.

My guess is that it only works because “stmp” isn’t reallocated but when using more ranks, it is.

Do I need to do something special when a GPU need to “talk” to another two GPUs simultaneously which are in a different MPI_COMM_WORLD?

The error doesn’t have to do with the number of ranks, but rather the code path taken as you add more ranks. Hence, I’ll answer this more generally.

Each MPI Rank would be it’s own process so has it’s own memory space and unique CUDA context. You can set-up a shared memory segment that’s visible from multi-ranks and GPUs on the same node, but this is complex to set-up and not something I’d recommend.

The more common approach is to use CUDA Aware MPI so the MPI communication calls with device data are directly transferred between device (either on the same node or across nodes) rather than having to copy the data back to the host.

To use CUDA Aware MPI, you need to use an MPI that has it enabled (such as the OpenMPI we ship with the compilers) and pass the device pointer to the MPI calls. In OpenACC, you’d wrap the calls with a “#pragma acc host_data use_device(<variable>)” region. “host_data” says to use the device pointer on the host.

For example:

          #pragma acc host_data use_device(topSndBuf, topRxBuf)
          {
            MPI_Irecv ( topRxBuf,  (15+8+3)*LSIZEX, MPI_DOUBLE, mpi_peer_top, tagDOWN, MPI_COMM_WORLD, &reqRxTopHalo   );

            MPI_Isend ( topSndBuf, (15+8+3)*LSIZEX, MPI_DOUBLE, mpi_peer_top, tagUP,   MPI_COMM_WORLD, &reqTxTopBorder );
          }

Hope this helps,
Mat

fw407 · May 9, 2022, 10:25pm

Hi Mat,

Many thanks for your reply! You are right on the mistake. I need to delete it on the device first and then on the host. I have been staring at the code for a while and did not notice this.

Once again, thanks for your help.
Feng

pjh40 · May 10, 2022, 7:54pm

Thanks for this post Mat, very informative, and it helped me fix a problem in my code!

Paul

system · May 24, 2022, 7:54pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Things goes wrong after "host_data use_device" section nvc, nvc++ and nvfortran openmpi	5	286	February 14, 2024
Randomly got error "FATAL ERROR: variable in data clause is partially present on the device: name=(unknown)" nvc, nvc++ and nvfortran	5	1174	March 23, 2021
Using multiple GPUs Legacy PGI Compilers	7	22065	August 11, 2009
Unusually slow MPI communication between GPUs nvc, nvc++ and nvfortran	1	462	September 5, 2023
using all 4 GPUs in S1070 from multi-core cpu? how CUDA Programming and Performance	11	32411	December 13, 2010
Segfault with MPI_Send + acc_malloc Legacy PGI Compilers	3	2580	April 8, 2020
Strange run-time error: variable data clause is partially... Legacy PGI Compilers	2	6823	August 7, 2012
FATAL ERROR: data in PRESENT clause was not found on device 1 Legacy PGI Compilers	5	1155	December 8, 2022
Error: Number of CUDA devices (2) is less than MPI processes (4) CUDA Programming and Performance	2	23	August 18, 2024
problem with multi gpu using mpi Legacy PGI Compilers	2	2170	December 2, 2015

Partially present on the device, multiple (more than 3) GPU and multiple (more than 3) MPI_COMM_WORLD

Related topics