TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU

alejandro.granda · October 31, 2023, 9:11am

Please provide the following information when requesting support.

• Hardware: 2x RTXA6000 1xRTX4090
• Network Type (Detectnet_v2)
• TLT Version (TAO5-fix)

Continue with this issue:

After do the system working, we decide to re-utilizate other GPU that become free in other project and include inside the workstation.

So now the set-up grow is a 2xRTXA6000 ada 48GbRAM and an extra RTX4090 with 24GBRAM.

I’m using the fix to work with the --use-amp and the visualizer.

Also launch the internal tests with the NCCL-Tests.

Attach the results:

LOG

tkeic@azken:~/TAO/Installation$ kubectl apply -f tao-toolkit-debug.yaml 
pod/debug created
tkeic@azken:~/TAO/Installation$ kubectl exec -it debug -- /bin/bash
root@debug:/workspace# nvidia-smi
Tue Oct 31 07:58:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   38C    P8    40W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:43:00.0 Off |                    0 |
|  0%   33C    P8    43W / 450W |      1MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 333 (delta 181), reused 137 (delta 131), pack-reused 122
Receiving objects: 100% (333/333), 125.18 KiB | 791.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make -j64
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1154 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1154 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1154 on      debug device  2 [0x43] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C^[[A^C^C

^C
root@debug:/workspace/nccl-tests# 
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1174 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1174 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     6.23    0.00    0.00      0     6.37    0.00    0.00      0
          16             4     float     sum      -1     6.44    0.00    0.00      0     6.28    0.00    0.00      0
          32             8     float     sum      -1     6.49    0.00    0.00      0     6.42    0.00    0.00      0
          64            16     float     sum      -1     6.43    0.01    0.01      0     6.29    0.01    0.01      0
         128            32     float     sum      -1     6.40    0.02    0.02      0     6.30    0.02    0.02      0
         256            64     float     sum      -1     6.39    0.04    0.04      0     6.25    0.04    0.04      0
         512           128     float     sum      -1     6.29    0.08    0.08      0     6.34    0.08    0.08      0
        1024           256     float     sum      -1     6.40    0.16    0.16      0     6.29    0.16    0.16      0
        2048           512     float     sum      -1     6.37    0.32    0.32      0     6.23    0.33    0.33      0
        4096          1024     float     sum      -1     6.37    0.64    0.64      0     6.41    0.64    0.64      0
        8192          2048     float     sum      -1     6.77    1.21    1.21      0     6.66    1.23    1.23      0
       16384          4096     float     sum      -1     7.48    2.19    2.19      0     7.34    2.23    2.23      0
       32768          8192     float     sum      -1     8.81    3.72    3.72      0    10.21    3.21    3.21      0
       65536         16384     float     sum      -1    12.09    5.42    5.42      0    11.96    5.48    5.48      0
      131072         32768     float     sum      -1    25.09    5.22    5.22      0    24.84    5.28    5.28      0
      262144         65536     float     sum      -1    38.15    6.87    6.87      0    37.98    6.90    6.90      0
      524288        131072     float     sum      -1    55.89    9.38    9.38      0    56.25    9.32    9.32      0
     1048576        262144     float     sum      -1    74.72   14.03   14.03      0    65.60   15.98   15.98      0
     2097152        524288     float     sum      -1    108.9   19.25   19.25      0    108.0   19.42   19.42      0
     4194304       1048576     float     sum      -1    198.2   21.17   21.17      0    197.6   21.23   21.23      0
     8388608       2097152     float     sum      -1    385.1   21.78   21.78      0    384.0   21.85   21.85      0
    16777216       4194304     float     sum      -1    760.4   22.06   22.06      0    759.0   22.10   22.10      0
    33554432       8388608     float     sum      -1   1507.3   22.26   22.26      0   1505.2   22.29   22.29      0
    67108864      16777216     float     sum      -1   3011.2   22.29   22.29      0   3002.9   22.35   22.35      0
   134217728      33554432     float     sum      -1   6000.8   22.37   22.37      0   5992.7   22.40   22.40      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.06157 
#

root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1189 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     3.55    0.00    0.00      0     0.17    0.05    0.00      0
          16             4     float     sum      -1     3.48    0.00    0.00      0     0.17    0.10    0.00      0
          32             8     float     sum      -1     3.50    0.01    0.00      0     0.16    0.20    0.00      0
          64            16     float     sum      -1     3.51    0.02    0.00      0     0.16    0.39    0.00      0
         128            32     float     sum      -1     3.46    0.04    0.00      0     0.16    0.80    0.00      0
         256            64     float     sum      -1     3.55    0.07    0.00      0     0.16    1.55    0.00      0
         512           128     float     sum      -1     3.47    0.15    0.00      0     0.17    3.07    0.00      0
        1024           256     float     sum      -1     3.50    0.29    0.00      0     0.16    6.29    0.00      0
        2048           512     float     sum      -1     3.43    0.60    0.00      0     0.16   12.62    0.00      0
        4096          1024     float     sum      -1     3.45    1.19    0.00      0     0.16   25.07    0.00      0
        8192          2048     float     sum      -1     3.40    2.41    0.00      0     0.17   47.95    0.00      0
       16384          4096     float     sum      -1     2.62    6.24    0.00      0     0.10  159.53    0.00      0
       32768          8192     float     sum      -1     2.55   12.87    0.00      0     0.10  322.20    0.00      0
       65536         16384     float     sum      -1     2.62   24.99    0.00      0     0.10  635.04    0.00      0
      131072         32768     float     sum      -1     2.61   50.31    0.00      0     0.11  1240.04    0.00      0
      262144         65536     float     sum      -1     2.63   99.85    0.00      0     0.11  2491.86    0.00      0
      524288        131072     float     sum      -1     2.71  193.16    0.00      0     0.10  5235.03    0.00      0
     1048576        262144     float     sum      -1     3.38  310.51    0.00      0     0.10  10623.87    0.00      0
     2097152        524288     float     sum      -1     4.36  481.34    0.00      0     0.10  21034.62    0.00      0
     4194304       1048576     float     sum      -1     7.13  588.45    0.00      0     0.10  42495.48    0.00      0
     8388608       2097152     float     sum      -1    19.28  435.16    0.00      0     0.10  86346.97    0.00      0
    16777216       4194304     float     sum      -1    38.28  438.27    0.00      0     0.11  151555.70    0.00      0
    33554432       8388608     float     sum      -1    80.34  417.65    0.00      0     0.11  311554.61    0.00      0
    67108864      16777216     float     sum      -1    168.3  398.77    0.00      0     0.10  650279.69    0.00      0
   134217728      33554432     float     sum      -1    337.4  397.83    0.00      0     0.11  1269798.75    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
root@debug:/workspace/nccl-tests# export NCCL_P2P_LEVEL=NV
root@debug:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1254 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1254 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1254 on      debug device  2 [0x43] NVIDIA GeForce RTX 4090
debug:1254:1254 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.107<0>
debug:1254:1254 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:1254:1254 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:1254:1254 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:1254:1254 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:1254:1254 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:1254:1265 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:1254:1265 [0] NCCL INFO P2P plugin IBext
debug:1254:1265 [0] NCCL INFO NET/IB : No device found.
debug:1254:1265 [0] NCCL INFO NET/IB : No device found.
debug:1254:1265 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.107<0>
debug:1254:1265 [0] NCCL INFO Using network Socket
debug:1254:1267 [2] NCCL INFO Using network Socket
debug:1254:1266 [1] NCCL INFO Using network Socket
debug:1254:1265 [0] NCCL INFO Channel 00/04 :    0   1   2
debug:1254:1265 [0] NCCL INFO Channel 01/04 :    0   1   2
debug:1254:1267 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
debug:1254:1267 [2] NCCL INFO P2P Chunksize set to 131072
debug:1254:1265 [0] NCCL INFO Channel 02/04 :    0   1   2
debug:1254:1265 [0] NCCL INFO Channel 03/04 :    0   1   2
debug:1254:1266 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
debug:1254:1266 [1] NCCL INFO P2P Chunksize set to 131072
debug:1254:1265 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
debug:1254:1265 [0] NCCL INFO P2P Chunksize set to 131072
debug:1254:1267 [2] NCCL INFO Channel 00/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 01/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 00/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 02/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 01/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 03/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 02/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 03/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Connected all rings
debug:1254:1266 [1] NCCL INFO Connected all rings
debug:1254:1267 [2] NCCL INFO Channel 00/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Connected all rings
debug:1254:1267 [2] NCCL INFO Channel 01/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 02/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 03/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Connected all trees
debug:1254:1267 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1267 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1265 [0] NCCL INFO Connected all trees
debug:1254:1266 [1] NCCL INFO Connected all trees
debug:1254:1266 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1266 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1265 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1265 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1266 [1] NCCL INFO comm 0x5603badfc5f0 rank 1 nranks 3 cudaDev 1 busId 22000 commId 0xb5554b354dbb8590 - Init COMPLETE
debug:1254:1265 [0] NCCL INFO comm 0x5603badf7280 rank 0 nranks 3 cudaDev 0 busId 21000 commId 0xb5554b354dbb8590 - Init COMPLETE
debug:1254:1267 [2] NCCL INFO comm 0x5603bae003b0 rank 2 nranks 3 cudaDev 2 busId 43000 commId 0xb5554b354dbb8590 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C

I can launch the test with 2 GPUS but with 3 get stuck.

Also review the ACSCtl configuration:

LOG

tkeic@azken:~$ sudo lspci -vvv | grep ACSCtl
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
tkeic@azken:~$ dmesg | grep IOMMU
tkeic@azken:~$

Can’t work with different GPUS models??

Any suggestion to can use the power of this GPU?

Morganh · October 31, 2023, 4:24pm

alejandro.granda:

root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1154 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1154 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1154 on      debug device  2 [0x43] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C^[[A^C^C

^C
root@debug:/workspace/nccl-tests#

To narrow down, could you please check if above nccl-test command can run successfully without the debug pod?

alejandro.granda · October 31, 2023, 5:30pm

How am I supposed to do that?
TAO installer delete all the presence of nvidia-drivers/cuda.

Morganh · November 1, 2023, 5:09am

OK, please check in the debug pod, you can try to set other containers in the yaml file.
For example,

BTW, make sure nvidia.com/gpu: 3 in the yaml file.

Also, you can print more info when run nccl test.
export NCCL_DEBUG=INFO

See more in Troubleshooting — NCCL 2.19.3 documentation.

You can also try to run https://github.com/NVIDIA/nvbandwidth which is mentioned in Troubleshooting — NCCL 2.19.3 documentation

alejandro.granda · November 6, 2023, 5:40pm

Same result the first test with the tensorrt container.
Attach log with more “INFO”:

LOG

tkeic@azken:~/TAO/Installation$ kubectl exec -it debug -- /bin/bash
root@debug:/workspace# export NCCL_DEBUG=INFO
root@debug:/workspace# nvidia-smi
Mon Nov  6 17:37:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   37C    P8    23W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:42:00.0 Off |                    0 |
|  0%   33C    P8    43W / 450W |      1MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (79/79), done.
remote: Total 333 (delta 181), reused 138 (delta 132), pack-reused 122
Receiving objects: 100% (333/333), 125.21 KiB | 678.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make -j64
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# export NCCL_DEBUG=INFO
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    958 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid    958 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid    958 on      debug device  2 [0x42] NVIDIA GeForce RTX 4090
debug:958:958 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.109<0>
debug:958:958 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:958:958 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:958:958 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:958:958 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:958:958 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:958:969 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:958:969 [0] NCCL INFO P2P plugin IBext
debug:958:969 [0] NCCL INFO NET/IB : No device found.
debug:958:969 [0] NCCL INFO NET/IB : No device found.
debug:958:969 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.109<0>
debug:958:969 [0] NCCL INFO Using network Socket
debug:958:971 [2] NCCL INFO Using network Socket
debug:958:970 [1] NCCL INFO Using network Socket
debug:958:971 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
debug:958:970 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
debug:958:969 [0] NCCL INFO Channel 00/04 :    0   1   2
debug:958:969 [0] NCCL INFO Channel 01/04 :    0   1   2
debug:958:969 [0] NCCL INFO Channel 02/04 :    0   1   2
debug:958:969 [0] NCCL INFO Channel 03/04 :    0   1   2
debug:958:969 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
debug:958:969 [0] NCCL INFO P2P Chunksize set to 131072
debug:958:970 [1] NCCL INFO P2P Chunksize set to 131072
debug:958:971 [2] NCCL INFO P2P Chunksize set to 131072
debug:958:971 [2] NCCL INFO Channel 00/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 01/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 00/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 02/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 03/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 01/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 02/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 03/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Connected all rings
debug:958:971 [2] NCCL INFO Connected all rings
debug:958:970 [1] NCCL INFO Connected all rings
debug:958:971 [2] NCCL INFO Channel 00/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 01/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 02/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 03/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Connected all trees
debug:958:971 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:958:971 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:958:969 [0] NCCL INFO Connected all trees
debug:958:970 [1] NCCL INFO Connected all trees
debug:958:969 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:958:969 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:958:970 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:958:970 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:958:971 [2] NCCL INFO comm 0x5640eae9c3b0 rank 2 nranks 3 cudaDev 2 busId 42000 commId 0xe03192219f67aa9d - Init COMPLETE
debug:958:969 [0] NCCL INFO comm 0x5640e74d9060 rank 0 nranks 3 cudaDev 0 busId 21000 commId 0xe03192219f67aa9d - Init COMPLETE
debug:958:970 [1] NCCL INFO comm 0x5640eb041220 rank 1 nranks 3 cudaDev 1 busId 22000 commId 0xe03192219f67aa9d - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C^C
root@debug:/workspace/nccl-tests#

Morganh · November 7, 2023, 8:19am

After checking, it is a known limitation. The peer-to-peer is disabled on 4090. More info can be found in

For nccl-test, the workaround is to run as below.
$ NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

alejandro.granda · November 7, 2023, 10:50am

Try to update the driver version to the last 525, as said in the last post messages from the discussion, and now have a very beautiful “Bus error (core dumped)”.

Trying to save time in the training process to finally lost all in the configuration…

root@debug:/workspace# nvidia-smi  
Tue Nov  7 10:24:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   37C    P8    23W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:42:00.0 Off |                    0 |
|  0%   33C    P8    45W / 450W |      1MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@debug:/workspace# export NCCL_DEBUG=DEBUG
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (79/79), done.
remote: Total 333 (delta 181), reused 138 (delta 132), pack-reused 122
Receiving objects: 100% (333/333), 125.18 KiB | 678.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make  
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1002 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1002 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1002 on      debug device  2 [0x42] NVIDIA GeForce RTX 4090
[debug:1002 :0:1015] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1019 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1019 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1019 on      debug device  2 [0x42] NVIDIA GeForce RTX 4090
[debug:1019 :0:1032] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1036 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1036 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
[debug:1036 :0:1048] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)

Morganh · November 7, 2023, 2:28pm

Please check ACS and IOMMU via TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - #27 by Morganh.

More, the peer-to-peer is not supported on 4090 according to Standard nVidia CUDA tests fail with dual RTX 4090 Linux box - #30 by abchauhan.

alejandro.granda · November 7, 2023, 3:12pm

I recheck these points before that.

sudo lspci -vvv | grep ACSCtl 
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
tkeic@azken:$sudo dmesg | grep IOMMU
tkeic@azken:$~

Also with the p2p disabled, your colleage mention that in this driver version (525.107.17), the RTX4090 is able to train without issues.

Morganh · November 7, 2023, 3:16pm

No, it mentions “CUDA sample tests will report that P2P is not supported.” instead of hang.
The p2p is still disabled for 4090.

Suggest you create a new topic in Linux - NVIDIA Developer Forums to request more info about nccl-test failed on multi gpu including 4090.

alejandro.granda · November 7, 2023, 3:19pm

Understand that the only solution is unmount the RTX4090 and throw to the trash?

Morganh · November 7, 2023, 3:20pm

Disabling p2p such as NCCL_P2P_DISABLE=1 could be a workaround.

alejandro.granda · November 7, 2023, 3:45pm

I Can’t do more test, using this command with the suggested drivers send the same CoreDumped…

I Unmount the board and connect to the old equipment. A shame can’t use it…
You can close the post.

Morganh · November 7, 2023, 3:51pm

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

When I go through NCCL all_reduce_perf test hangs with multiple RTX 4090 GPUs, works fine when I swap in 2080tis · Issue #117 · NVIDIA/nccl-tests · GitHub, above workaround can work on nccl-test.

Sorry for the inconvenient. Maybe you can create a topic in Issues · NVIDIA/nccl-tests · GitHub and Linux - NVIDIA Developer Forums to more info.

system · November 27, 2023, 4:22am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	1845	August 29, 2023
WSL2 & TAO issues TAO Toolkit wsl , tao	27	3804	January 5, 2022
More than 1 GPU not working using Tao Train TAO Toolkit	47	4630	April 9, 2023
Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted TAO Toolkit	30	1909	June 1, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	80	2164	October 11, 2023
Multi GPU's and invalid loss TAO Toolkit	18	1184	July 19, 2022
ncclAllReduce failed: unhandled cuda error DGX User Forum	9	4325	May 27, 2021
Multigpu training raises error TAO Toolkit	9	1138	November 15, 2022
Cannot train Tao Toolkit UNet model in version v4.0.0 and v4.0.1 TAO Toolkit tao	16	743	July 13, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO-API Dead at train start TAO Toolkit	46	1014	August 3, 2023

TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU

Related topics