Nsys profile with horovod leading to GPU stalling for multiple GPUs (A100)

milan.jain · November 20, 2021, 1:07am

I am using nsys profiler to profile a script which is using horovod for multi-gpu scaling. It works fine without the nsys profiler, however, it keeps throwing me following error with the profiler. Here is a link to the post where similar issue was faced by someone: multiple_communicators branch gets deadlock on Alltoall - githubmemory.

Details about the system:
Tensorflow: 2.4.1
PyTorch: 1.9.0
Horovod: 0.23.0
Cuda: 11.0
GPU: A100-SXM4-40GB

[2021-11-18 21:35:46.256559: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:36:46.257429: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:37:46.257757: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:38:46.257948: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]

milan.jain · November 20, 2021, 6:18pm

The issue got resolved by using horovodrun -np <n_gpus> instead of mpirun -n <n_gpus> --bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH

Topic		Replies	Views
Multi Node Profiling with Nsight Systems Profiling Linux Targets	7	1483	July 8, 2024
Nsys profile can hang for a long time when profiling pytorch distributed training runs Profiling Linux Targets	4	255	October 24, 2025
Execute multi GPU with nsys profile command but GPU may be locked Profiling x86 Windows Targets	6	1104	June 17, 2024
Question when Prifilling Megatron-LM Profiling Linux Targets cudnn , llama	8	126	November 14, 2025
Nsight "Resource temporarily unavailable" Profiling Linux Targets	5	1390	October 30, 2024
NCU hangs when trying to profile a multi gpu kernel Nsight Compute	4	526	January 8, 2025
Nsys hangs when profiling any cuda process Profiling Linux Targets cuda	1	319	August 11, 2025
Horovod using only a gpu no matter what np value? Deep Learning (Training & Inference)	0	303	July 8, 2020
Profiling fails on more than one gpu device Nsight Compute	9	1203	November 15, 2023
NCU and Nsys hangs Indefinitely Profiling Linux Targets	2	173	March 27, 2025

Nsys profile with horovod leading to GPU stalling for multiple GPUs (A100)

Related topics