How to perform inter-GPU communication using NCCL2 across different hosts without MPI?

Hi all.

After reading multiple times the documentation, it is still not clear to me how the inter-GPU communication using NCCL2 across different hosts works.

In the MPI-based example, it is using MPI to broadcast the NCCL unique ID of each process to all other processes (ranks).

However, how does the communication really happen?
For example, let’s consider that I am using socket: How is the connection between process in different host created? How is the ip/address discovered?

Is it really possible to do inter-GPU communication across different hosts using NCCL2 without MPI?

Kind regards!

I found the same question asked here: https://devtalk.nvidia.com/default/topic/1032266. I believe the answer is yes. Please see my response there.