Sendrecv_perf nccl-tests - The process needs to be terminated manually - Volatile GPU-Util: 100%

I’m trying to test my network performance using nccl-tests sendrecv_perf on the NCCL but the process is being killed without any apparent errors. Below the output:

server@server150:~$ mpirun -np 2 --host 10.20.1.18:1,10.20.0.2:1 -x NCCL_DEBUG=INFO -x NCCL_ALGO=Ring -x NCCL_PROTO=Simple -x NCCL_NET=Socket -x NCCL_OOB_NET_ENABLE=1 -x NCCL_OOB_NET_IFNAME=ens6f0np0,ens3f0np0 -x NCCL_SOCKET_IFNAME=ens6f0np0,ens3f0np0 /root/nccl-tests-master/build/sendrecv_perf -b 1M -e 2M -i 1M
# nThread 1 nGpus 1 minBytes 1048576 maxBytes 2097152 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2367754 on server150 device  0 [0x31] NVIDIA A100-PCIE-40GB
#  Rank  1 Group  0 Pid 335074 on server137 device  0 [0x31] NVIDIA A100 80GB PCIe
server150:2367754:2367754 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server150:2367754:2367754 [0] NCCL INFO Bootstrap : Using ens6f0np0:10.20.1.18<0>
server150:2367754:2367754 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libsharp_coll.so.5: cannot open shared object file: No such file or directory
server150:2367754:2367754 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
server150:2367754:2367754 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.18.3+cuda12.1
server150:2367754:2367798 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server150:2367754:2367798 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [RO]; OOB ens6f0np0:10.20.1.18<0>
server150:2367754:2367798 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server150:2367754:2367798 [0] NCCL INFO NET/Socket : Using [0]ens6f0np0:10.20.1.18<0>
server150:2367754:2367798 [0] NCCL INFO Using network Socket
server137:335074:335074 [0] NCCL INFO cudaDriverVersion 12010
server137:335074:335074 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server137:335074:335074 [0] NCCL INFO Bootstrap : Using ens3f0np0:10.20.0.2<0>
server137:335074:335074 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
server137:335074:335074 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
server137:335074:335081 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server137:335074:335081 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [RO]; OOB ens3f0np0:10.20.0.2<0>
server137:335074:335081 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server137:335074:335081 [0] NCCL INFO NET/Socket : Using [0]ens3f0np0:10.20.0.2<0>
server137:335074:335081 [0] NCCL INFO Using network Socket
server150:2367754:2367798 [0] NCCL INFO comm 0x56517ac89260 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init START
server137:335074:335081 [0] NCCL INFO comm 0x55d252cf5fc0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init START
server137:335074:335081 [0] NCCL INFO Setting affinity for GPU 0 to 100000,00000001
server150:2367754:2367798 [0] NCCL INFO Setting affinity for GPU 0 to 100000,00000001
server150:2367754:2367798 [0] NCCL INFO Channel 00/02 :    0   1
server150:2367754:2367798 [0] NCCL INFO Channel 01/02 :    0   1
server150:2367754:2367798 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
server150:2367754:2367798 [0] NCCL INFO P2P Chunksize set to 131072
server137:335074:335081 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
server137:335074:335081 [0] NCCL INFO P2P Chunksize set to 131072
server150:2367754:2367798 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
server150:2367754:2367798 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
server150:2367754:2367798 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
server150:2367754:2367798 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Connected all rings
server137:335074:335081 [0] NCCL INFO Connected all trees
server137:335074:335081 [0] NCCL INFO NCCL_PROTO set by environment to Simple
server137:335074:335081 [0] NCCL INFO NCCL_ALGO set by environment to Ring
server137:335074:335081 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
server137:335074:335081 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server150:2367754:2367798 [0] NCCL INFO Connected all rings
server150:2367754:2367798 [0] NCCL INFO Connected all trees
server150:2367754:2367798 [0] NCCL INFO NCCL_PROTO set by environment to Simple
server150:2367754:2367798 [0] NCCL INFO NCCL_ALGO set by environment to Ring
server150:2367754:2367798 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
server150:2367754:2367798 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server137:335074:335081 [0] NCCL INFO comm 0x55d252cf5fc0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init COMPLETE
server150:2367754:2367798 [0] NCCL INFO comm 0x56517ac89260 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
server150:2367754:2367805 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [receive] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [receive] via NET/Socket/0/Shared
server150:2367754:2367805 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [receive] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [receive] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [send] via NET/Socket/0/Shared
server150:2367754:2367805 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [send] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [send] via NET/Socket/0/Shared
server150:2367754:2367805 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [send] via NET/Socket/0/Shared

Query GPU info. Below the output:

server@server137:~$ nvidia-smi
Thu Feb  6 02:20:42 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe           Off| 00000000:31:00.0 Off |                    0 |
| N/A   61C    P0               82W / 300W|  77692MiB / 81920MiB |    100%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80GB PCIe           Off| 00000000:98:00.0 Off |                    0 |
| N/A   74C    P0              109W / 300W|  77106MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    324064      C                                             77104MiB |
|    0   N/A  N/A    335074      C   ...cl-tests-master/build/sendrecv_perf      578MiB |
|    1   N/A  N/A    329839      C                                             77096MiB |
+---------------------------------------------------------------------------------------+

server@server150:~$ nvidia-smi
Thu Feb  6 10:10:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:31:00.0 Off |                    0 |
| N/A   56C    P0             74W /  250W |     585MiB /  40960MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off |   00000000:98:00.0 Off |                    0 |
| N/A   47C    P0             39W /  250W |   34193MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   2367754      C   ...cl-tests-master/build/sendrecv_perf        576MiB |
|    1   N/A  N/A     65518      C                                               34170MiB |
+-----------------------------------------------------------------------------------------+

Configuration Information

~/.openmpi/mca-params.conf
	# btl_tcp_if_include=ens6f0np0
	btl_tcp_if_include=ens3f0np0
	plm_rsh_agent=/usr/bin/ssh
	pml=ob1
	btl=tcp,self

export PATH=/usr/local/openmpi/bin:/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib/:/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH

System info

 Static hostname: serverXX
       Icon name: computer-server
         Chassis: server
Operating System: Ubuntu 22.04 LTS                
          Kernel: Linux 5.15.0-25-generic
    Architecture: x86-64

Is there anything I am missing?