Nsys profile with horovod leading to GPU stalling for multiple GPUs (A100)

I am using nsys profiler to profile a script which is using horovod for multi-gpu scaling. It works fine without the nsys profiler, however, it keeps throwing me following error with the profiler. Here is a link to the post where similar issue was faced by someone: multiple_communicators branch gets deadlock on Alltoall - githubmemory.

Details about the system:
Tensorflow: 2.4.1
PyTorch: 1.9.0
Horovod: 0.23.0
Cuda: 11.0
GPU: A100-SXM4-40GB

[2021-11-18 21:35:46.256559: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:36:46.257429: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:37:46.257757: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]
[2021-11-18 21:38:46.257948: W /tmp/pip-install-2ki5kaqo/horovod_610732e1ed6541bc8eb95d6e0fc227fd/horovod/common/stall_inspector.cc:107] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
Missing ranks:
0: [allreduce.noname.1]
2: [allreduce.noname.1]

The issue got resolved by using horovodrun -np <n_gpus> instead of mpirun -n <n_gpus> --bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH