I try to follow the steps of this other posts:
But get stuck once started. I don’t know how many time need to finish. I’m wating more than 15 minutes and no movement or results.
The GPUs are lock at 100% of the clock frequency.
Also attach the nvidia-smi:
nvidia-smi
root@9ea20d6ac6f2:/workspace/nccl-tests# nvidia-smi
Fri May 26 11:20:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX 6000... Off | 00000000:21:00.0 Off | Off |
| 32% 54C P8 32W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX 6000... Off | 00000000:22:00.0 Off | Off |
| 37% 58C P8 41W / 300W | 0MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Attach the log:
root@9ea20d6ac6f2:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1034 on 9ea20d6ac6f2 device 0 [0x21] NVIDIA RTX 6000 Ada Generation
# Rank 1 Group 0 Pid 1034 on 9ea20d6ac6f2 device 1 [0x22] NVIDIA RTX 6000 Ada Generation
9ea20d6ac6f2:1034:1034 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
9ea20d6ac6f2:1034:1034 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
9ea20d6ac6f2:1034:1034 [1] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
9ea20d6ac6f2:1034:1043 [0] NCCL INFO P2P plugin IBext
9ea20d6ac6f2:1034:1043 [0] NCCL INFO NET/IB : No device found.
9ea20d6ac6f2:1034:1043 [0] NCCL INFO NET/IB : No device found.
9ea20d6ac6f2:1034:1043 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Using network Socket
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Using network Socket
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 00/04 : 0 1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 01/04 : 0 1
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 02/04 : 0 1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 03/04 : 0 1
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Connected all rings
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Connected all rings
9ea20d6ac6f2:1034:1044 [1] NCCL INFO Connected all trees
9ea20d6ac6f2:1034:1044 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
9ea20d6ac6f2:1034:1044 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
9ea20d6ac6f2:1034:1043 [0] NCCL INFO Connected all trees
9ea20d6ac6f2:1034:1043 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
9ea20d6ac6f2:1034:1043 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
9ea20d6ac6f2:1034:1044 [1] NCCL INFO comm 0x556ce5af0be0 rank 1 nranks 2 cudaDev 1 busId 22000 - Init COMPLETE
9ea20d6ac6f2:1034:1043 [0] NCCL INFO comm 0x556ce5aee150 rank 0 nranks 2 cudaDev 0 busId 21000 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)