Please provide the following information when requesting support.
• Hardware: 2x RTXA6000 1xRTX4090
• Network Type (Detectnet_v2)
• TLT Version (TAO5-fix)
Continue with this issue:
After do the system working, we decide to re-utilizate other GPU that become free in other project and include inside the workstation.
So now the set-up grow is a 2xRTXA6000 ada 48GbRAM and an extra RTX4090 with 24GBRAM.
I’m using the fix to work with the --use-amp and the visualizer.
Also launch the internal tests with the NCCL-Tests.
Attach the results:
LOG
tkeic@azken:~/TAO/Installation$ kubectl apply -f tao-toolkit-debug.yaml
pod/debug created
tkeic@azken:~/TAO/Installation$ kubectl exec -it debug -- /bin/bash
root@debug:/workspace# nvidia-smi
Tue Oct 31 07:58:16 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA RTX 6000... On | 00000000:21:00.0 Off | Off |
| 30% 38C P8 40W / 300W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX 6000... On | 00000000:22:00.0 Off | Off |
| 30% 43C P8 29W / 300W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... On | 00000000:43:00.0 Off | 0 |
| 0% 33C P8 43W / 450W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 333 (delta 181), reused 137 (delta 131), pack-reused 122
Receiving objects: 100% (333/333), 125.18 KiB | 791.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make -j64
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling timer.cc > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling all_reduce.cu > /workspace/nccl-tests/build/all_reduce.o
Compiling common.cu > /workspace/nccl-tests/build/common.o
Compiling all_gather.cu > /workspace/nccl-tests/build/all_gather.o
Compiling broadcast.cu > /workspace/nccl-tests/build/broadcast.o
Compiling reduce_scatter.cu > /workspace/nccl-tests/build/reduce_scatter.o
Compiling reduce.cu > /workspace/nccl-tests/build/reduce.o
Compiling alltoall.cu > /workspace/nccl-tests/build/alltoall.o
Compiling scatter.cu > /workspace/nccl-tests/build/scatter.o
Compiling gather.cu > /workspace/nccl-tests/build/gather.o
Compiling sendrecv.cu > /workspace/nccl-tests/build/sendrecv.o
Compiling hypercube.cu > /workspace/nccl-tests/build/hypercube.o
Linking /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Linking /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Linking /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Linking /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Linking /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Linking /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Linking /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Linking /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Linking /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Linking /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1154 on debug device 0 [0x21] NVIDIA RTX 6000 Ada Generation
# Rank 1 Group 0 Pid 1154 on debug device 1 [0x22] NVIDIA RTX 6000 Ada Generation
# Rank 2 Group 0 Pid 1154 on debug device 2 [0x43] NVIDIA GeForce RTX 4090
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
^C^[[A^C^C
^C
root@debug:/workspace/nccl-tests#
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1174 on debug device 0 [0x21] NVIDIA RTX 6000 Ada Generation
# Rank 1 Group 0 Pid 1174 on debug device 1 [0x22] NVIDIA RTX 6000 Ada Generation
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 6.23 0.00 0.00 0 6.37 0.00 0.00 0
16 4 float sum -1 6.44 0.00 0.00 0 6.28 0.00 0.00 0
32 8 float sum -1 6.49 0.00 0.00 0 6.42 0.00 0.00 0
64 16 float sum -1 6.43 0.01 0.01 0 6.29 0.01 0.01 0
128 32 float sum -1 6.40 0.02 0.02 0 6.30 0.02 0.02 0
256 64 float sum -1 6.39 0.04 0.04 0 6.25 0.04 0.04 0
512 128 float sum -1 6.29 0.08 0.08 0 6.34 0.08 0.08 0
1024 256 float sum -1 6.40 0.16 0.16 0 6.29 0.16 0.16 0
2048 512 float sum -1 6.37 0.32 0.32 0 6.23 0.33 0.33 0
4096 1024 float sum -1 6.37 0.64 0.64 0 6.41 0.64 0.64 0
8192 2048 float sum -1 6.77 1.21 1.21 0 6.66 1.23 1.23 0
16384 4096 float sum -1 7.48 2.19 2.19 0 7.34 2.23 2.23 0
32768 8192 float sum -1 8.81 3.72 3.72 0 10.21 3.21 3.21 0
65536 16384 float sum -1 12.09 5.42 5.42 0 11.96 5.48 5.48 0
131072 32768 float sum -1 25.09 5.22 5.22 0 24.84 5.28 5.28 0
262144 65536 float sum -1 38.15 6.87 6.87 0 37.98 6.90 6.90 0
524288 131072 float sum -1 55.89 9.38 9.38 0 56.25 9.32 9.32 0
1048576 262144 float sum -1 74.72 14.03 14.03 0 65.60 15.98 15.98 0
2097152 524288 float sum -1 108.9 19.25 19.25 0 108.0 19.42 19.42 0
4194304 1048576 float sum -1 198.2 21.17 21.17 0 197.6 21.23 21.23 0
8388608 2097152 float sum -1 385.1 21.78 21.78 0 384.0 21.85 21.85 0
16777216 4194304 float sum -1 760.4 22.06 22.06 0 759.0 22.10 22.10 0
33554432 8388608 float sum -1 1507.3 22.26 22.26 0 1505.2 22.29 22.29 0
67108864 16777216 float sum -1 3011.2 22.29 22.29 0 3002.9 22.35 22.35 0
134217728 33554432 float sum -1 6000.8 22.37 22.37 0 5992.7 22.40 22.40 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.06157
#
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1189 on debug device 0 [0x21] NVIDIA RTX 6000 Ada Generation
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 3.55 0.00 0.00 0 0.17 0.05 0.00 0
16 4 float sum -1 3.48 0.00 0.00 0 0.17 0.10 0.00 0
32 8 float sum -1 3.50 0.01 0.00 0 0.16 0.20 0.00 0
64 16 float sum -1 3.51 0.02 0.00 0 0.16 0.39 0.00 0
128 32 float sum -1 3.46 0.04 0.00 0 0.16 0.80 0.00 0
256 64 float sum -1 3.55 0.07 0.00 0 0.16 1.55 0.00 0
512 128 float sum -1 3.47 0.15 0.00 0 0.17 3.07 0.00 0
1024 256 float sum -1 3.50 0.29 0.00 0 0.16 6.29 0.00 0
2048 512 float sum -1 3.43 0.60 0.00 0 0.16 12.62 0.00 0
4096 1024 float sum -1 3.45 1.19 0.00 0 0.16 25.07 0.00 0
8192 2048 float sum -1 3.40 2.41 0.00 0 0.17 47.95 0.00 0
16384 4096 float sum -1 2.62 6.24 0.00 0 0.10 159.53 0.00 0
32768 8192 float sum -1 2.55 12.87 0.00 0 0.10 322.20 0.00 0
65536 16384 float sum -1 2.62 24.99 0.00 0 0.10 635.04 0.00 0
131072 32768 float sum -1 2.61 50.31 0.00 0 0.11 1240.04 0.00 0
262144 65536 float sum -1 2.63 99.85 0.00 0 0.11 2491.86 0.00 0
524288 131072 float sum -1 2.71 193.16 0.00 0 0.10 5235.03 0.00 0
1048576 262144 float sum -1 3.38 310.51 0.00 0 0.10 10623.87 0.00 0
2097152 524288 float sum -1 4.36 481.34 0.00 0 0.10 21034.62 0.00 0
4194304 1048576 float sum -1 7.13 588.45 0.00 0 0.10 42495.48 0.00 0
8388608 2097152 float sum -1 19.28 435.16 0.00 0 0.10 86346.97 0.00 0
16777216 4194304 float sum -1 38.28 438.27 0.00 0 0.11 151555.70 0.00 0
33554432 8388608 float sum -1 80.34 417.65 0.00 0 0.11 311554.61 0.00 0
67108864 16777216 float sum -1 168.3 398.77 0.00 0 0.10 650279.69 0.00 0
134217728 33554432 float sum -1 337.4 397.83 0.00 0 0.11 1269798.75 0.00 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0
#
root@debug:/workspace/nccl-tests# export NCCL_P2P_LEVEL=NV
root@debug:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1254 on debug device 0 [0x21] NVIDIA RTX 6000 Ada Generation
# Rank 1 Group 0 Pid 1254 on debug device 1 [0x22] NVIDIA RTX 6000 Ada Generation
# Rank 2 Group 0 Pid 1254 on debug device 2 [0x43] NVIDIA GeForce RTX 4090
debug:1254:1254 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.107<0>
debug:1254:1254 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:1254:1254 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:1254:1254 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:1254:1254 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:1254:1254 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:1254:1265 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:1254:1265 [0] NCCL INFO P2P plugin IBext
debug:1254:1265 [0] NCCL INFO NET/IB : No device found.
debug:1254:1265 [0] NCCL INFO NET/IB : No device found.
debug:1254:1265 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.107<0>
debug:1254:1265 [0] NCCL INFO Using network Socket
debug:1254:1267 [2] NCCL INFO Using network Socket
debug:1254:1266 [1] NCCL INFO Using network Socket
debug:1254:1265 [0] NCCL INFO Channel 00/04 : 0 1 2
debug:1254:1265 [0] NCCL INFO Channel 01/04 : 0 1 2
debug:1254:1267 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
debug:1254:1267 [2] NCCL INFO P2P Chunksize set to 131072
debug:1254:1265 [0] NCCL INFO Channel 02/04 : 0 1 2
debug:1254:1265 [0] NCCL INFO Channel 03/04 : 0 1 2
debug:1254:1266 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
debug:1254:1266 [1] NCCL INFO P2P Chunksize set to 131072
debug:1254:1265 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
debug:1254:1265 [0] NCCL INFO P2P Chunksize set to 131072
debug:1254:1267 [2] NCCL INFO Channel 00/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 01/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 00/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 02/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 01/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 03/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 02/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 03/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Connected all rings
debug:1254:1266 [1] NCCL INFO Connected all rings
debug:1254:1267 [2] NCCL INFO Channel 00/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Connected all rings
debug:1254:1267 [2] NCCL INFO Channel 01/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 02/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 03/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Connected all trees
debug:1254:1267 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1267 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1265 [0] NCCL INFO Connected all trees
debug:1254:1266 [1] NCCL INFO Connected all trees
debug:1254:1266 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1266 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1265 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1265 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1266 [1] NCCL INFO comm 0x5603badfc5f0 rank 1 nranks 3 cudaDev 1 busId 22000 commId 0xb5554b354dbb8590 - Init COMPLETE
debug:1254:1265 [0] NCCL INFO comm 0x5603badf7280 rank 0 nranks 3 cudaDev 0 busId 21000 commId 0xb5554b354dbb8590 - Init COMPLETE
debug:1254:1267 [2] NCCL INFO comm 0x5603bae003b0 rank 2 nranks 3 cudaDev 2 busId 43000 commId 0xb5554b354dbb8590 - Init COMPLETE
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
^C
I can launch the test with 2 GPUS but with 3 get stuck.
Also review the ACSCtl configuration:
LOG
tkeic@azken:~$ sudo lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
tkeic@azken:~$ dmesg | grep IOMMU
tkeic@azken:~$
Can’t work with different GPUS models??
Any suggestion to can use the power of this GPU?