thanks for the feedback on my talk. Regarding your question: Yes for most implementations that is outdated information it is no longer needed to set the device before calling MPI_Init. However it still should be selected before calling MPI routines that require a GPU context and not change after that. What exact MPI routines require a GPU context depends on your MPI implementation. A mental model that works for me is: All MPI communication routines that accept a pointer or reference to data that should be communicated need a GPU context. Creating a new communicator with e.g. MPI_Comm_split_type and querying the size or rank of a communicator does not require GPU context and can thus be used before selecting a device.
However I think the FAQ entry you are referring to is “10. What are some guidelines for using CUDA and Open MPI with Omni-Path?” which is specific to Omni-Path. So if you are using the GPUDirect RDMA support of Omni-Path it is probably still required to select a device before calling MPI_Init. I will try to get a clarification on that.
If not, how do we go about doing that when we determine the GPU using the local rank?
You can use getenv and query a environment variable set to the local rank on a node. All launchers I am aware of set a variable like this. E.g. the OpenMPI launcher set OMPI_COMM_WORLD_LOCAL_RANK and the MVAPICH2 launcher sets MV2_COMM_WORLD_LOCAL_RANK.
I hope this helps.