Why nvshmem init takes so long

houqi1993 · July 16, 2025, 7:32am

as i watch that use nvshmem4py with uid bootstrap, on H800 * 8 machine, it takes about 18s to run nvshmem.core.init

    nvshmem.core.init(device=Device(torch.cuda.current_device()), 
                      uid=broadcast_objects[0], 
                      rank=rank_id,
                      nranks=num_ranks, 
                      initializer_method="uid")

why it takes so long?

benjaming1 · July 16, 2025, 5:47pm

Thank you for your inquiry. The short answer to your question is “it depends on a lot of system-specific configuration”.

Are you using the NVSHMEM PMI-X bootstrap transport? What environment variables are you setting, if any?

houqi1993 · July 16, 2025, 11:59pm

Are you using the NVSHMEM PMI-X bootstrap transport?

No, I’m using UID.

What environment variables are you setting, if any?

NVSHMEM_SYMMETRIC_SIZE=1000000000
NVSHMEM_DISABLE_CUDA_VMM=1
NVSHMEM_BOOTSTRAP=UID
NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
NVSHMEM_BOOTSTRAP_UID_SOCK_FAMILY

benjaming1 · July 17, 2025, 7:03pm

Could you get a trace with something like Nsight-systems and see where in the nvshmem init is slow? Sometimes what happens is the GPU ends up getting initialized during the init process, which can take a long time. If you have a function-level breakdown it’s easier to see what’s going on.

bloring1 · November 21, 2025, 5:44pm

I’ve observed nvshmem init takes longer than other comm libraries init (eg MPI). Nsys profiling on my use case shows nvshmem internally call nccl init, which itself is very costly where as MPI does not. That alone is enough to explain why nvshmem init is slower than MPI and nccl init. There is an nvshmem environment variable to disable nccl use in nvshmem, if it is not needed (eg you’re not doing collectives).

When comparing nvshmem init to init time of other libraries one must also look at the first data movement calls (eg first call to MPI_Send) because some costly initializations may be deferred until the first use. Nsys profiles of nvshmem, nccl, and mpi revealed that both nccl and mpi do costly initialization in the first call, while nvshem did not.

Note that my use case point to point only, your use case may be different, it would be worth doing your own nsys profiles.

houqi1993 · November 23, 2025, 11:18pm

thanks for your insight.

I switch the nvshmem from the released prebuilt version to the self-compiled version, then it takes only 1-2 seconds to bootstrap.

I compiled nvshmem without NCCL and IBGDA and MPI.

Topic		Replies	Views
NVSHMEM runtime error GPU-Accelerated Libraries nvshmem	11	2165	August 16, 2022
Nvshmem_runtime_error GPU-Accelerated Libraries nvshmem	3	406	July 7, 2024
NVSHMEM runtime initialization GPU-Accelerated Libraries nvshmem	1	227	November 14, 2024
NVSHMEM program fails to initialize Other Tools	0	380	November 16, 2020
Potential NVSHMEM allocated memory performance issue GPU-Accelerated Libraries nvshmem	19	1834	May 10, 2024
Running Nvshmem from custom build bootstrap GPU-Accelerated Libraries nvshmem	0	540	November 30, 2023
NVSHMEM Compilling GPU-Accelerated Libraries nvshmem	5	869	January 2, 2024
NVSHMEM on multi-node GPUs failed . My gpu is A5000 GPU-Accelerated Libraries nvshmem	5	1258	April 1, 2024
NVSHMEM setup GPU-Accelerated Libraries gpu-computing	0	184	October 6, 2024
Nvshmem fails to finalize GPU-Accelerated Libraries cuda , nvshmem	4	1386	January 16, 2024

Why nvshmem init takes so long

Related topics