TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU

Please provide the following information when requesting support.

• Hardware: 2x RTXA6000 1xRTX4090
• Network Type (Detectnet_v2)
• TLT Version (TAO5-fix)

Continue with this issue:

After do the system working, we decide to re-utilizate other GPU that become free in other project and include inside the workstation.

So now the set-up grow is a 2xRTXA6000 ada 48GbRAM and an extra RTX4090 with 24GBRAM.

I’m using the fix to work with the --use-amp and the visualizer.

Also launch the internal tests with the NCCL-Tests.

Attach the results:

LOG
tkeic@azken:~/TAO/Installation$ kubectl apply -f tao-toolkit-debug.yaml 
pod/debug created
tkeic@azken:~/TAO/Installation$ kubectl exec -it debug -- /bin/bash
root@debug:/workspace# nvidia-smi
Tue Oct 31 07:58:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   38C    P8    40W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:43:00.0 Off |                    0 |
|  0%   33C    P8    43W / 450W |      1MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 333 (delta 181), reused 137 (delta 131), pack-reused 122
Receiving objects: 100% (333/333), 125.18 KiB | 791.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make -j64
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1154 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1154 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1154 on      debug device  2 [0x43] NVIDIA GeForce RTX 4090
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C^[[A^C^C

^C
root@debug:/workspace/nccl-tests# 
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1174 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1174 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     6.23    0.00    0.00      0     6.37    0.00    0.00      0
          16             4     float     sum      -1     6.44    0.00    0.00      0     6.28    0.00    0.00      0
          32             8     float     sum      -1     6.49    0.00    0.00      0     6.42    0.00    0.00      0
          64            16     float     sum      -1     6.43    0.01    0.01      0     6.29    0.01    0.01      0
         128            32     float     sum      -1     6.40    0.02    0.02      0     6.30    0.02    0.02      0
         256            64     float     sum      -1     6.39    0.04    0.04      0     6.25    0.04    0.04      0
         512           128     float     sum      -1     6.29    0.08    0.08      0     6.34    0.08    0.08      0
        1024           256     float     sum      -1     6.40    0.16    0.16      0     6.29    0.16    0.16      0
        2048           512     float     sum      -1     6.37    0.32    0.32      0     6.23    0.33    0.33      0
        4096          1024     float     sum      -1     6.37    0.64    0.64      0     6.41    0.64    0.64      0
        8192          2048     float     sum      -1     6.77    1.21    1.21      0     6.66    1.23    1.23      0
       16384          4096     float     sum      -1     7.48    2.19    2.19      0     7.34    2.23    2.23      0
       32768          8192     float     sum      -1     8.81    3.72    3.72      0    10.21    3.21    3.21      0
       65536         16384     float     sum      -1    12.09    5.42    5.42      0    11.96    5.48    5.48      0
      131072         32768     float     sum      -1    25.09    5.22    5.22      0    24.84    5.28    5.28      0
      262144         65536     float     sum      -1    38.15    6.87    6.87      0    37.98    6.90    6.90      0
      524288        131072     float     sum      -1    55.89    9.38    9.38      0    56.25    9.32    9.32      0
     1048576        262144     float     sum      -1    74.72   14.03   14.03      0    65.60   15.98   15.98      0
     2097152        524288     float     sum      -1    108.9   19.25   19.25      0    108.0   19.42   19.42      0
     4194304       1048576     float     sum      -1    198.2   21.17   21.17      0    197.6   21.23   21.23      0
     8388608       2097152     float     sum      -1    385.1   21.78   21.78      0    384.0   21.85   21.85      0
    16777216       4194304     float     sum      -1    760.4   22.06   22.06      0    759.0   22.10   22.10      0
    33554432       8388608     float     sum      -1   1507.3   22.26   22.26      0   1505.2   22.29   22.29      0
    67108864      16777216     float     sum      -1   3011.2   22.29   22.29      0   3002.9   22.35   22.35      0
   134217728      33554432     float     sum      -1   6000.8   22.37   22.37      0   5992.7   22.40   22.40      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 8.06157 
#

root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1189 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     3.55    0.00    0.00      0     0.17    0.05    0.00      0
          16             4     float     sum      -1     3.48    0.00    0.00      0     0.17    0.10    0.00      0
          32             8     float     sum      -1     3.50    0.01    0.00      0     0.16    0.20    0.00      0
          64            16     float     sum      -1     3.51    0.02    0.00      0     0.16    0.39    0.00      0
         128            32     float     sum      -1     3.46    0.04    0.00      0     0.16    0.80    0.00      0
         256            64     float     sum      -1     3.55    0.07    0.00      0     0.16    1.55    0.00      0
         512           128     float     sum      -1     3.47    0.15    0.00      0     0.17    3.07    0.00      0
        1024           256     float     sum      -1     3.50    0.29    0.00      0     0.16    6.29    0.00      0
        2048           512     float     sum      -1     3.43    0.60    0.00      0     0.16   12.62    0.00      0
        4096          1024     float     sum      -1     3.45    1.19    0.00      0     0.16   25.07    0.00      0
        8192          2048     float     sum      -1     3.40    2.41    0.00      0     0.17   47.95    0.00      0
       16384          4096     float     sum      -1     2.62    6.24    0.00      0     0.10  159.53    0.00      0
       32768          8192     float     sum      -1     2.55   12.87    0.00      0     0.10  322.20    0.00      0
       65536         16384     float     sum      -1     2.62   24.99    0.00      0     0.10  635.04    0.00      0
      131072         32768     float     sum      -1     2.61   50.31    0.00      0     0.11  1240.04    0.00      0
      262144         65536     float     sum      -1     2.63   99.85    0.00      0     0.11  2491.86    0.00      0
      524288        131072     float     sum      -1     2.71  193.16    0.00      0     0.10  5235.03    0.00      0
     1048576        262144     float     sum      -1     3.38  310.51    0.00      0     0.10  10623.87    0.00      0
     2097152        524288     float     sum      -1     4.36  481.34    0.00      0     0.10  21034.62    0.00      0
     4194304       1048576     float     sum      -1     7.13  588.45    0.00      0     0.10  42495.48    0.00      0
     8388608       2097152     float     sum      -1    19.28  435.16    0.00      0     0.10  86346.97    0.00      0
    16777216       4194304     float     sum      -1    38.28  438.27    0.00      0     0.11  151555.70    0.00      0
    33554432       8388608     float     sum      -1    80.34  417.65    0.00      0     0.11  311554.61    0.00      0
    67108864      16777216     float     sum      -1    168.3  398.77    0.00      0     0.10  650279.69    0.00      0
   134217728      33554432     float     sum      -1    337.4  397.83    0.00      0     0.11  1269798.75    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#
root@debug:/workspace/nccl-tests# export NCCL_P2P_LEVEL=NV
root@debug:/workspace/nccl-tests# export NCCL_DEBUG=TRACE
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1254 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1254 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1254 on      debug device  2 [0x43] NVIDIA GeForce RTX 4090
debug:1254:1254 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.107<0>
debug:1254:1254 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:1254:1254 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:1254:1254 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:1254:1254 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:1254:1254 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:1254:1265 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:1254:1265 [0] NCCL INFO P2P plugin IBext
debug:1254:1265 [0] NCCL INFO NET/IB : No device found.
debug:1254:1265 [0] NCCL INFO NET/IB : No device found.
debug:1254:1265 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.107<0>
debug:1254:1265 [0] NCCL INFO Using network Socket
debug:1254:1267 [2] NCCL INFO Using network Socket
debug:1254:1266 [1] NCCL INFO Using network Socket
debug:1254:1265 [0] NCCL INFO Channel 00/04 :    0   1   2
debug:1254:1265 [0] NCCL INFO Channel 01/04 :    0   1   2
debug:1254:1267 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
debug:1254:1267 [2] NCCL INFO P2P Chunksize set to 131072
debug:1254:1265 [0] NCCL INFO Channel 02/04 :    0   1   2
debug:1254:1265 [0] NCCL INFO Channel 03/04 :    0   1   2
debug:1254:1266 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
debug:1254:1266 [1] NCCL INFO P2P Chunksize set to 131072
debug:1254:1265 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
debug:1254:1265 [0] NCCL INFO P2P Chunksize set to 131072
debug:1254:1267 [2] NCCL INFO Channel 00/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 01/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 00/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 02/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 01/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 03/0 : 2[43000] -> 0[21000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 02/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 03/0 : 1[22000] -> 2[43000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Connected all rings
debug:1254:1266 [1] NCCL INFO Connected all rings
debug:1254:1267 [2] NCCL INFO Channel 00/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1265 [0] NCCL INFO Connected all rings
debug:1254:1267 [2] NCCL INFO Channel 01/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 02/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Channel 03/0 : 2[43000] -> 1[22000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1266 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:1254:1267 [2] NCCL INFO Connected all trees
debug:1254:1267 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1267 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1265 [0] NCCL INFO Connected all trees
debug:1254:1266 [1] NCCL INFO Connected all trees
debug:1254:1266 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1266 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1265 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:1254:1265 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:1254:1266 [1] NCCL INFO comm 0x5603badfc5f0 rank 1 nranks 3 cudaDev 1 busId 22000 commId 0xb5554b354dbb8590 - Init COMPLETE
debug:1254:1265 [0] NCCL INFO comm 0x5603badf7280 rank 0 nranks 3 cudaDev 0 busId 21000 commId 0xb5554b354dbb8590 - Init COMPLETE
debug:1254:1267 [2] NCCL INFO comm 0x5603bae003b0 rank 2 nranks 3 cudaDev 2 busId 43000 commId 0xb5554b354dbb8590 - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C


I can launch the test with 2 GPUS but with 3 get stuck.

Also review the ACSCtl configuration:

LOG
tkeic@azken:~$ sudo lspci -vvv | grep ACSCtl
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
tkeic@azken:~$ dmesg | grep IOMMU
tkeic@azken:~$

Can’t work with different GPUS models??

Any suggestion to can use the power of this GPU?

To narrow down, could you please check if above nccl-test command can run successfully without the debug pod?

How am I supposed to do that?
TAO installer delete all the presence of nvidia-drivers/cuda.

OK, please check in the debug pod, you can try to set other containers in the yaml file.
For example,

BTW, make sure nvidia.com/gpu: 3 in the yaml file.

Also, you can print more info when run nccl test.
export NCCL_DEBUG=INFO

See more in Troubleshooting — NCCL 2.19.3 documentation.

You can also try to run https://github.com/NVIDIA/nvbandwidth which is mentioned in Troubleshooting — NCCL 2.19.3 documentation

Same result the first test with the tensorrt container.
Attach log with more “INFO”:

LOG
tkeic@azken:~/TAO/Installation$ kubectl exec -it debug -- /bin/bash
root@debug:/workspace# export NCCL_DEBUG=INFO
root@debug:/workspace# nvidia-smi
Mon Nov  6 17:37:54 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   37C    P8    23W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:42:00.0 Off |                    0 |
|  0%   33C    P8    43W / 450W |      1MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (79/79), done.
remote: Total 333 (delta 181), reused 138 (delta 132), pack-reused 122
Receiving objects: 100% (333/333), 125.21 KiB | 678.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make -j64
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# export NCCL_DEBUG=INFO
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    958 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid    958 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid    958 on      debug device  2 [0x42] NVIDIA GeForce RTX 4090
debug:958:958 [0] NCCL INFO Bootstrap : Using eth0:192.168.35.109<0>
debug:958:958 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
debug:958:958 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
debug:958:958 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
debug:958:958 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
debug:958:958 [2] NCCL INFO cudaDriverVersion 12000
NCCL version 2.16.5+cuda12.0
debug:958:969 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
debug:958:969 [0] NCCL INFO P2P plugin IBext
debug:958:969 [0] NCCL INFO NET/IB : No device found.
debug:958:969 [0] NCCL INFO NET/IB : No device found.
debug:958:969 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.35.109<0>
debug:958:969 [0] NCCL INFO Using network Socket
debug:958:971 [2] NCCL INFO Using network Socket
debug:958:970 [1] NCCL INFO Using network Socket
debug:958:971 [2] NCCL INFO Trees [0] -1/-1/-1->2->1 [1] -1/-1/-1->2->1 [2] -1/-1/-1->2->1 [3] -1/-1/-1->2->1
debug:958:970 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
debug:958:969 [0] NCCL INFO Channel 00/04 :    0   1   2
debug:958:969 [0] NCCL INFO Channel 01/04 :    0   1   2
debug:958:969 [0] NCCL INFO Channel 02/04 :    0   1   2
debug:958:969 [0] NCCL INFO Channel 03/04 :    0   1   2
debug:958:969 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
debug:958:969 [0] NCCL INFO P2P Chunksize set to 131072
debug:958:970 [1] NCCL INFO P2P Chunksize set to 131072
debug:958:971 [2] NCCL INFO P2P Chunksize set to 131072
debug:958:971 [2] NCCL INFO Channel 00/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 00/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 01/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 01/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 00/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 02/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 02/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 03/0 : 2[42000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 01/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Channel 03/0 : 0[21000] -> 1[22000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 02/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 03/0 : 1[22000] -> 2[42000] via P2P/direct pointer
debug:958:969 [0] NCCL INFO Connected all rings
debug:958:971 [2] NCCL INFO Connected all rings
debug:958:970 [1] NCCL INFO Connected all rings
debug:958:971 [2] NCCL INFO Channel 00/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 01/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 02/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Channel 03/0 : 2[42000] -> 1[22000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 00/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 01/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 02/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:970 [1] NCCL INFO Channel 03/0 : 1[22000] -> 0[21000] via P2P/direct pointer
debug:958:971 [2] NCCL INFO Connected all trees
debug:958:971 [2] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:958:971 [2] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:958:969 [0] NCCL INFO Connected all trees
debug:958:970 [1] NCCL INFO Connected all trees
debug:958:969 [0] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:958:969 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:958:970 [1] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512
debug:958:970 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
debug:958:971 [2] NCCL INFO comm 0x5640eae9c3b0 rank 2 nranks 3 cudaDev 2 busId 42000 commId 0xe03192219f67aa9d - Init COMPLETE
debug:958:969 [0] NCCL INFO comm 0x5640e74d9060 rank 0 nranks 3 cudaDev 0 busId 21000 commId 0xe03192219f67aa9d - Init COMPLETE
debug:958:970 [1] NCCL INFO comm 0x5640eb041220 rank 1 nranks 3 cudaDev 1 busId 22000 commId 0xe03192219f67aa9d - Init COMPLETE
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
^C^C
root@debug:/workspace/nccl-tests# 

After checking, it is a known limitation. The peer-to-peer is disabled on 4090. More info can be found in

For nccl-test, the workaround is to run as below.
$ NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3

Try to update the driver version to the last 525, as said in the last post messages from the discussion, and now have a very beautiful “Bus error (core dumped)”.

Trying to save time in the training process to finally lost all in the configuration…

root@debug:/workspace# nvidia-smi  
Tue Nov  7 10:24:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  On   | 00000000:21:00.0 Off |                  Off |
| 30%   37C    P8    23W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000...  On   | 00000000:22:00.0 Off |                  Off |
| 30%   43C    P8    29W / 300W |      1MiB / 49140MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  On   | 00000000:42:00.0 Off |                    0 |
|  0%   33C    P8    45W / 450W |      1MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@debug:/workspace# export NCCL_DEBUG=DEBUG
root@debug:/workspace# git clone https://github.com/NVIDIA/nccl-tests.git
Cloning into 'nccl-tests'...
remote: Enumerating objects: 333, done.
remote: Counting objects: 100% (211/211), done.
remote: Compressing objects: 100% (79/79), done.
remote: Total 333 (delta 181), reused 138 (delta 132), pack-reused 122
Receiving objects: 100% (333/333), 125.18 KiB | 678.00 KiB/s, done.
Resolving deltas: 100% (220/220), done.
root@debug:/workspace# cd nccl-tests/
root@debug:/workspace/nccl-tests# make  
make -C src build BUILDDIR=/workspace/nccl-tests/build
make[1]: Entering directory '/workspace/nccl-tests/src'
Compiling  timer.cc                            > /workspace/nccl-tests/build/timer.o
Compiling /workspace/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/nccl-tests/build/common.o
Linking  /workspace/nccl-tests/build/all_reduce.o > /workspace/nccl-tests/build/all_reduce_perf
Compiling  all_gather.cu                       > /workspace/nccl-tests/build/all_gather.o
Linking  /workspace/nccl-tests/build/all_gather.o > /workspace/nccl-tests/build/all_gather_perf
Compiling  broadcast.cu                        > /workspace/nccl-tests/build/broadcast.o
Linking  /workspace/nccl-tests/build/broadcast.o > /workspace/nccl-tests/build/broadcast_perf
Compiling  reduce_scatter.cu                   > /workspace/nccl-tests/build/reduce_scatter.o
Linking  /workspace/nccl-tests/build/reduce_scatter.o > /workspace/nccl-tests/build/reduce_scatter_perf
Compiling  reduce.cu                           > /workspace/nccl-tests/build/reduce.o
Linking  /workspace/nccl-tests/build/reduce.o > /workspace/nccl-tests/build/reduce_perf
Compiling  alltoall.cu                         > /workspace/nccl-tests/build/alltoall.o
Linking  /workspace/nccl-tests/build/alltoall.o > /workspace/nccl-tests/build/alltoall_perf
Compiling  scatter.cu                          > /workspace/nccl-tests/build/scatter.o
Linking  /workspace/nccl-tests/build/scatter.o > /workspace/nccl-tests/build/scatter_perf
Compiling  gather.cu                           > /workspace/nccl-tests/build/gather.o
Linking  /workspace/nccl-tests/build/gather.o > /workspace/nccl-tests/build/gather_perf
Compiling  sendrecv.cu                         > /workspace/nccl-tests/build/sendrecv.o
Linking  /workspace/nccl-tests/build/sendrecv.o > /workspace/nccl-tests/build/sendrecv_perf
Compiling  hypercube.cu                        > /workspace/nccl-tests/build/hypercube.o
Linking  /workspace/nccl-tests/build/hypercube.o > /workspace/nccl-tests/build/hypercube_perf
make[1]: Leaving directory '/workspace/nccl-tests/src'
root@debug:/workspace/nccl-tests# NCCL_P2P_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1002 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1002 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1002 on      debug device  2 [0x42] NVIDIA GeForce RTX 4090
[debug:1002 :0:1015] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 3
# nThread 1 nGpus 3 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1019 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1019 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
#  Rank  2 Group  0 Pid   1019 on      debug device  2 [0x42] NVIDIA GeForce RTX 4090
[debug:1019 :0:1032] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)
root@debug:/workspace/nccl-tests# ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   1036 on      debug device  0 [0x21] NVIDIA RTX 6000 Ada Generation
#  Rank  1 Group  0 Pid   1036 on      debug device  1 [0x22] NVIDIA RTX 6000 Ada Generation
[debug:1036 :0:1048] Caught signal 7 (Bus error: nonexistent physical address)
Bus error (core dumped)

Please check ACS and IOMMU via TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - #27 by Morganh.

More, the peer-to-peer is not supported on 4090 according to Standard nVidia CUDA tests fail with dual RTX 4090 Linux box - #30 by abchauhan.

I recheck these points before that.

sudo lspci -vvv | grep ACSCtl 
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
tkeic@azken:$sudo dmesg | grep IOMMU
tkeic@azken:$~

Also with the p2p disabled, your colleage mention that in this driver version (525.107.17), the RTX4090 is able to train without issues.

No, it mentions “CUDA sample tests will report that P2P is not supported.” instead of hang.
The p2p is still disabled for 4090.

Suggest you create a new topic in Linux - NVIDIA Developer Forums to request more info about nccl-test failed on multi gpu including 4090.

Understand that the only solution is unmount the RTX4090 and throw to the trash?

Disabling p2p such as NCCL_P2P_DISABLE=1 could be a workaround.

1 Like

I Can’t do more test, using this command with the suggested drivers send the same CoreDumped…

I Unmount the board and connect to the old equipment. A shame can’t use it…
You can close the post.

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

When I go through NCCL all_reduce_perf test hangs with multiple RTX 4090 GPUs, works fine when I swap in 2080tis · Issue #117 · NVIDIA/nccl-tests · GitHub, above workaround can work on nccl-test.

Sorry for the inconvenient. Maybe you can create a topic in Issues · NVIDIA/nccl-tests · GitHub and Linux - NVIDIA Developer Forums to more info.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.