Rogue Processes with MPI

mikostul · December 7, 2022, 6:17pm

I have been having some problems with a code that generates rogue processes and gives me the wrong answer when I utilize Unified Memory. Here I have included an example code. The first code is similar to what is being done with my current code.

psi_multigpu_test_code_roughprocess.f (31.7 KB)

When I compile this for the GPU with Unified Memory and run it on 4 MPI ranks, it generates rogue processes on rank zero like so:

If I move my “!$acc set device” earlier in the code, the rogue processes do not generate. To my knowledge, there is no code that would run on the GPU between MPI_init and the “!$acc set device” with my first code. Below I have my fixed code which just has the “!$acc set device” moved.

psi_multigpu_test_code_fixed.f (31.7 KB)

When I compile this for the GPU with Unified Memory and run it on 4 MPI ranks, it generates no rogue processes on rank zero.

My question is why does my first version of the code lead to rogue processes?

Thanks,

Miko

mfatica · December 7, 2022, 7:12pm

Probably memory allocations. You should always set the device at the very beginning, after the mpi_init.

system · December 28, 2022, 8:43pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.