I’m trying to test my network performance using nccl-tests sendrecv_perf on the NCCL but the process is being killed without any apparent errors. Below the output:
server@server150:~$ mpirun -np 2 --host 10.20.1.18:1,10.20.0.2:1 -x NCCL_DEBUG=INFO -x NCCL_ALGO=Ring -x NCCL_PROTO=Simple -x NCCL_NET=Socket -x NCCL_OOB_NET_ENABLE=1 -x NCCL_OOB_NET_IFNAME=ens6f0np0,ens3f0np0 -x NCCL_SOCKET_IFNAME=ens6f0np0,ens3f0np0 /root/nccl-tests-master/build/sendrecv_perf -b 1M -e 2M -i 1M
# nThread 1 nGpus 1 minBytes 1048576 maxBytes 2097152 step: 1048576(bytes) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 2367754 on server150 device 0 [0x31] NVIDIA A100-PCIE-40GB
# Rank 1 Group 0 Pid 335074 on server137 device 0 [0x31] NVIDIA A100 80GB PCIe
server150:2367754:2367754 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server150:2367754:2367754 [0] NCCL INFO Bootstrap : Using ens6f0np0:10.20.1.18<0>
server150:2367754:2367754 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libsharp_coll.so.5: cannot open shared object file: No such file or directory
server150:2367754:2367754 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
server150:2367754:2367754 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.18.3+cuda12.1
server150:2367754:2367798 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server150:2367754:2367798 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [1]mlx5_2:1/RoCE [RO]; OOB ens6f0np0:10.20.1.18<0>
server150:2367754:2367798 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server150:2367754:2367798 [0] NCCL INFO NET/Socket : Using [0]ens6f0np0:10.20.1.18<0>
server150:2367754:2367798 [0] NCCL INFO Using network Socket
server137:335074:335074 [0] NCCL INFO cudaDriverVersion 12010
server137:335074:335074 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server137:335074:335074 [0] NCCL INFO Bootstrap : Using ens3f0np0:10.20.0.2<0>
server137:335074:335074 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
server137:335074:335074 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
server137:335074:335081 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server137:335074:335081 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [RO]; OOB ens3f0np0:10.20.0.2<0>
server137:335074:335081 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens6f0np0,ens3f0np0
server137:335074:335081 [0] NCCL INFO NET/Socket : Using [0]ens3f0np0:10.20.0.2<0>
server137:335074:335081 [0] NCCL INFO Using network Socket
server150:2367754:2367798 [0] NCCL INFO comm 0x56517ac89260 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init START
server137:335074:335081 [0] NCCL INFO comm 0x55d252cf5fc0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init START
server137:335074:335081 [0] NCCL INFO Setting affinity for GPU 0 to 100000,00000001
server150:2367754:2367798 [0] NCCL INFO Setting affinity for GPU 0 to 100000,00000001
server150:2367754:2367798 [0] NCCL INFO Channel 00/02 : 0 1
server150:2367754:2367798 [0] NCCL INFO Channel 01/02 : 0 1
server150:2367754:2367798 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
server150:2367754:2367798 [0] NCCL INFO P2P Chunksize set to 131072
server137:335074:335081 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
server137:335074:335081 [0] NCCL INFO P2P Chunksize set to 131072
server150:2367754:2367798 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
server150:2367754:2367798 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [receive] via NET/Socket/0
server150:2367754:2367798 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [receive] via NET/Socket/0
server150:2367754:2367798 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 00/0 : 1[0] -> 0[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Channel 01/0 : 1[0] -> 0[0] [send] via NET/Socket/0
server137:335074:335081 [0] NCCL INFO Connected all rings
server137:335074:335081 [0] NCCL INFO Connected all trees
server137:335074:335081 [0] NCCL INFO NCCL_PROTO set by environment to Simple
server137:335074:335081 [0] NCCL INFO NCCL_ALGO set by environment to Ring
server137:335074:335081 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
server137:335074:335081 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server150:2367754:2367798 [0] NCCL INFO Connected all rings
server150:2367754:2367798 [0] NCCL INFO Connected all trees
server150:2367754:2367798 [0] NCCL INFO NCCL_PROTO set by environment to Simple
server150:2367754:2367798 [0] NCCL INFO NCCL_ALGO set by environment to Ring
server150:2367754:2367798 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
server150:2367754:2367798 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
server137:335074:335081 [0] NCCL INFO comm 0x55d252cf5fc0 rank 1 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init COMPLETE
server150:2367754:2367798 [0] NCCL INFO comm 0x56517ac89260 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 31000 commId 0xcb86d884eb454020 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
server150:2367754:2367805 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [receive] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [receive] via NET/Socket/0/Shared
server150:2367754:2367805 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [receive] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [receive] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 00/1 : 1[0] -> 0[0] [send] via NET/Socket/0/Shared
server150:2367754:2367805 [0] NCCL INFO Channel 00/1 : 0[0] -> 1[0] [send] via NET/Socket/0/Shared
server137:335074:335086 [0] NCCL INFO Channel 01/1 : 1[0] -> 0[0] [send] via NET/Socket/0/Shared
server150:2367754:2367805 [0] NCCL INFO Channel 01/1 : 0[0] -> 1[0] [send] via NET/Socket/0/Shared
Query GPU info. Below the output:
server@server137:~$ nvidia-smi
Thu Feb 6 02:20:42 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off| 00000000:31:00.0 Off | 0 |
| N/A 61C P0 82W / 300W| 77692MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe Off| 00000000:98:00.0 Off | 0 |
| N/A 74C P0 109W / 300W| 77106MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 324064 C 77104MiB |
| 0 N/A N/A 335074 C ...cl-tests-master/build/sendrecv_perf 578MiB |
| 1 N/A N/A 329839 C 77096MiB |
+---------------------------------------------------------------------------------------+
server@server150:~$ nvidia-smi
Thu Feb 6 10:10:03 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:31:00.0 Off | 0 |
| N/A 56C P0 74W / 250W | 585MiB / 40960MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:98:00.0 Off | 0 |
| N/A 47C P0 39W / 250W | 34193MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 2367754 C ...cl-tests-master/build/sendrecv_perf 576MiB |
| 1 N/A N/A 65518 C 34170MiB |
+-----------------------------------------------------------------------------------------+
Configuration Information
~/.openmpi/mca-params.conf
# btl_tcp_if_include=ens6f0np0
btl_tcp_if_include=ens3f0np0
plm_rsh_agent=/usr/bin/ssh
pml=ob1
btl=tcp,self
export PATH=/usr/local/openmpi/bin:/usr/local/cuda-12.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/openmpi/lib/:/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH
System info
Static hostname: serverXX
Icon name: computer-server
Chassis: server
Operating System: Ubuntu 22.04 LTS
Kernel: Linux 5.15.0-25-generic
Architecture: x86-64
Is there anything I am missing?