Hi Feng,
A “partially present” error means that the memory already exists on the device but has a different size than what was originally allocated. While there are many causes, often it’s caused when forgetting to put a variable in a “exit data” directive before reallocating on the host.
The error I see in your code is at line 106:
kdt->build( nx,n, tmp );
delete[] stmp; stmp=NULL;
delete[] stmp1; stmp1=NULL;
#pragma acc exit data delete(stmp,stmp1)
You’re deleting the device data after deleting the host data and setting this variable to NULL. Passing a null pointer to the exit data becomes a no-op so the next time through this section, “stmp” still has an associated device pointer, but with a different size. To fix, move the “exit data” directive before deleting it from the host.
It may be ok if “stmp” and “stmp1” weren’t set to NULL, but it’s always best practice to delete the device data before the host data.
Now for simplicity, assume each component will use only one GPU. I have tested the code on two components and each component use one GPU and it is working fine.
My guess is that it only works because “stmp” isn’t reallocated but when using more ranks, it is.
Do I need to do something special when a GPU need to “talk” to another two GPUs simultaneously which are in a different MPI_COMM_WORLD?
The error doesn’t have to do with the number of ranks, but rather the code path taken as you add more ranks. Hence, I’ll answer this more generally.
Each MPI Rank would be it’s own process so has it’s own memory space and unique CUDA context. You can set-up a shared memory segment that’s visible from multi-ranks and GPUs on the same node, but this is complex to set-up and not something I’d recommend.
The more common approach is to use CUDA Aware MPI so the MPI communication calls with device data are directly transferred between device (either on the same node or across nodes) rather than having to copy the data back to the host.
To use CUDA Aware MPI, you need to use an MPI that has it enabled (such as the OpenMPI we ship with the compilers) and pass the device pointer to the MPI calls. In OpenACC, you’d wrap the calls with a “#pragma acc host_data use_device(<variable>)” region. “host_data” says to use the device pointer on the host.
For example:
#pragma acc host_data use_device(topSndBuf, topRxBuf)
{
MPI_Irecv ( topRxBuf, (15+8+3)*LSIZEX, MPI_DOUBLE, mpi_peer_top, tagDOWN, MPI_COMM_WORLD, &reqRxTopHalo );
MPI_Isend ( topSndBuf, (15+8+3)*LSIZEX, MPI_DOUBLE, mpi_peer_top, tagUP, MPI_COMM_WORLD, &reqTxTopBorder );
}
Hope this helps,
Mat