MPI + openacc, process rank 0 consume a lot of memory

Hi All,

I have been testing multi-gpu computation using MPI + openacc. My strategy is to assign each MPI process (CPU) a unique GPU device. After that, I will transfer arrays from each MPI process to each unique GPU, respectively, using DATA COPYIN. The memory should have been equal by 4 GPU since I have done a even number of domain decomposition, but nvidia-smi shows that the processes 0 consume 4 more times of memory, and that the processes 0 somehow is called 4 times

Is it normal or if I missed something? The whole processes runs very well without bug, but just wanna lower the memory consumption of processes 0

Thanks!

Hi HydroHLLCV,

When the MPI processes start-up, they do create a context on the default device, then after setting a different device, another is created but the first will remain.

In my experience, a context takes about 450MB but here the initialization seems to be around 1203MB. So you might have something else being allocated on the device before the device setting is done. Though I’m not sure.

The way to solve the extra context problem is to use a shell script to wrap the mpirun call and set the environment variable “CUDA_VISIBLE_DEVICE” with the device id for each rank.

-Mat