Hi All,
I have been testing multi-gpu computation using MPI + openacc. My strategy is to assign each MPI process (CPU) a unique GPU device. After that, I will transfer arrays from each MPI process to each unique GPU, respectively, using DATA COPYIN. The memory should have been equal by 4 GPU since I have done a even number of domain decomposition, but nvidia-smi shows that the processes 0 consume 4 more times of memory, and that the processes 0 somehow is called 4 times
Is it normal or if I missed something? The whole processes runs very well without bug, but just wanna lower the memory consumption of processes 0
Thanks!