as i watch that use nvshmem4py with uid bootstrap, on H800 * 8 machine, it takes about 18s to run nvshmem.core.init
nvshmem.core.init(device=Device(torch.cuda.current_device()),
uid=broadcast_objects[0],
rank=rank_id,
nranks=num_ranks,
initializer_method="uid")
why it takes so long?
Thank you for your inquiry. The short answer to your question is “it depends on a lot of system-specific configuration”.
Are you using the NVSHMEM PMI-X bootstrap transport? What environment variables are you setting, if any?
Could you get a trace with something like Nsight-systems and see where in the nvshmem init is slow? Sometimes what happens is the GPU ends up getting initialized during the init process, which can take a long time. If you have a function-level breakdown it’s easier to see what’s going on.
I’ve observed nvshmem init takes longer than other comm libraries init (eg MPI). Nsys profiling on my use case shows nvshmem internally call nccl init, which itself is very costly where as MPI does not. That alone is enough to explain why nvshmem init is slower than MPI and nccl init. There is an nvshmem environment variable to disable nccl use in nvshmem, if it is not needed (eg you’re not doing collectives).
When comparing nvshmem init to init time of other libraries one must also look at the first data movement calls (eg first call to MPI_Send) because some costly initializations may be deferred until the first use. Nsys profiles of nvshmem, nccl, and mpi revealed that both nccl and mpi do costly initialization in the first call, while nvshem did not.
Note that my use case point to point only, your use case may be different, it would be worth doing your own nsys profiles.
thanks for your insight.
I switch the nvshmem from the released prebuilt version to the self-compiled version, then it takes only 1-2 seconds to bootstrap.
I compiled nvshmem without NCCL and IBGDA and MPI.