ncclAllReduce failed: unhandled cuda error

erik.johnsson · May 7, 2021, 7:29am

We are currently testing the latest nvidia tensorflow docker container (21.04) using some Tesla V100 GPUs with the latest driver of 460.32.03 with horovod. We are using the experimental_compile option for some tf.function in the training code, so the code is partially compiled with XLA.

We get the following error when starting the training:

tensorflow.python.framework.errors_impl[1,0]:.UnknownError[1,0]:: ncclAllReduce failed: unhandled cuda error
[1,0]: [[{{node DistributedGradientTape_Allreduce/cond_452/then/_3643/DistributedGradientTape_Allreduce/cond_452/HorovodAllreduce_gradient_tape_model_Conv2D_Conv2DBackpropFilter_0}}]] [Op:__inference_training_step_82823]

Do you have any idea what could go wrong? We did not replace any of the nvidia tools (cuda, cudnn, nccl, …) in the container.

czankel · May 8, 2021, 1:34pm

Hi Erik,

We found some incompatibility issues between the driver and fabric-manager packages, and have pushed out an updated driver package to the repository. Could you try if that fixes the issue?

erik.johnsson · May 11, 2021, 6:48am

Hi,

Thanks for the quick answer. We tried upgrading the drivers to version 460.73.01 (is this the version you suggested?), but now we get a segfault:

[1,2]:[6b9dcc145d7b:153 :0:639] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[1,3]:[6b9dcc145d7b:154 :0:640] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[1,1]:[6b9dcc145d7b:152 :0:641] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[1,1]:==== backtrace (tid: 641) ====
[1,1]: 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb15a322d24]
[1,1]: 1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7fb15a322eff]
[1,1]: 2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7fb15a323234]
[1,1]: 3 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fb293e6b3c0]
[1,1]: 4 /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66) [0x7fb19194f796]
[1,1]: 5 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54) [0x7fb15d179b94]
[1,1]: 6 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221) [0x7fb15d17a4a1]
[1,1]: 7 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d) [0x7fb15d14491d]
[1,1]: 8 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48) [0x7fb15d144cf8]
[1,1]: 9 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce) [0x7fb15d1208ce]
[1,3]:==== backtrace (tid: 640) ====
[1,3]: 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7feb1cf17d24]
[1,3]: 1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7feb1cf17eff]
[1,3]: 2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7feb1cf18234]
[1,3]: 3 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fec56a613c0]
[1,3]: 4 /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66) [0x7feb544dc796]
[1,3]: 5 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54) [0x7feb1fd54b94]
[1,3]: 6 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221) [0x7feb1fd554a1]
[1,3]: 7 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d) [0x7feb1fd1f91d]
[1,3]: 8 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48) [0x7feb1fd1fcf8]
[1,3]: 9 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce) [0x7feb1fcfb8ce]
[1,3]:10 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84) [0x7febedabbd84]
[1,3]:11 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fec56a55609]
[1,3]:12 /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fec56b91293]
[1,3]:=================================
[1,2]:==== backtrace (tid: 639) ====
[1,2]: 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fedd00c0d24]
[1,2]: 1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7fedd00c0eff]
[1,2]: 2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7fedd00c1234]
[1,2]: 3 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fef0b8093c0]
[1,2]: 4 /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66) [0x7fee092ce796]
[1,2]: 5 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54) [0x7fedd2b16b94]
[1,2]: 6 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221) [0x7fedd2b174a1]
[1,2]: 7 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d) [0x7fedd2ae191d]
[1,2]: 8 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48) [0x7fedd2ae1cf8]
[1,2]: 9 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce) [0x7fedd2abd8ce]
[1,2]:10 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84) [0x7feea2863d84]
[1,2]:11 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fef0b7fd609]
[1,2]:12 /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fef0b939293]
[1,2]:=================================
[1,1]:10 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84) [0x7fb22aec5d84]
[1,1]:11 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fb293e5f609]
[1,1]:12 /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fb293f9b293]
[1,1]:=================================
[1,1]:[6b9dcc145d7b:00152] *** Process received signal ***
[1,3]:[6b9dcc145d7b:00154] *** Process received signal ***
[1,3]:[6b9dcc145d7b:00154] Signal: Segmentation fault (11)
[1,3]:[6b9dcc145d7b:00154] Signal code: (-6)
[1,3]:[6b9dcc145d7b:00154] Failing at address: 0x9a
[1,2]:[6b9dcc145d7b:00153] *** Process received signal ***
[1,2]:[6b9dcc145d7b:00153] Signal: Segmentation fault (11)
[1,2]:[6b9dcc145d7b:00153] Signal code: (-6)
[1,2]:[6b9dcc145d7b:00153] Failing at address: 0x99
[1,1]:[6b9dcc145d7b:00152] Signal: Segmentation fault (11)
[1,1]:[6b9dcc145d7b:00152] Signal code: (-6)
[1,1]:[6b9dcc145d7b:00152] Failing at address: 0x98
[1,1]:[6b9dcc145d7b:00152] [ 0] [1,3]:[6b9dcc145d7b:00154] [ 0] [1,2]:[6b9dcc145d7b:00153] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fef0b8093c0]
[1,2]:[6b9dcc145d7b:00153] [ 1] [1,1]:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fb293e6b3c0]
[1,1]:[6b9dcc145d7b:00152] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66)[0x7fb19194f796]
[1,1]:[6b9dcc145d7b:00152] [ 2] [1,3]:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fec56a613c0]
[1,3]:[6b9dcc145d7b:00154] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66)[0x7feb544dc796]
[1,3]:[6b9dcc145d7b:00154] [ 2] [1,2]:/usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66)[0x7fee092ce796]
[1,2]:[6b9dcc145d7b:00153] [ 2] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54)[0x7fb15d179b94]
[1,1]:[6b9dcc145d7b:00152] [ 3] [1,3]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54)[0x7feb1fd54b94]
[1,3]:[6b9dcc145d7b:00154] [ 3] [1,2]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54)[0x7fedd2b16b94]
[1,2]:[6b9dcc145d7b:00153] [ 3] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221)[0x7fedd2b174a1]
[1,2]:[6b9dcc145d7b:00153] [ 4] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221)[0x7fb15d17a4a1]
[1,1]:[6b9dcc145d7b:00152] [ 4] [1,3]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221)[0x7feb1fd554a1]
[1,3]:[6b9dcc145d7b:00154] [ 4] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fb15d14491d]
[1,1]:[6b9dcc145d7b:00152] [ 5] [1,2]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fedd2ae191d]
[1,2]:[6b9dcc145d7b:00153] [ 5] [1,3]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7feb1fd1f91d]
[1,3]:[6b9dcc145d7b:00154] [ 5] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48)[0x7feb1fd1fcf8]
[1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48)[0x7fb15d144cf8]
[1,1]:[6b9dcc145d7b:00152] [ 6] [1,2]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48)[0x7fedd2ae1cf8]
[1,2]:[6b9dcc145d7b:00153] [ 6] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce)[0x7fedd2abd8ce]
[1,3]:[6b9dcc145d7b:00154] [ 6] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce)[0x7feb1fcfb8ce]
[1,3]:[6b9dcc145d7b:00154] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce)[0x7fb15d1208ce]
[1,1]:[6b9dcc145d7b:00152] [ 7] [1,2]:[6b9dcc145d7b:00153] [ 7] [1,3]:[ 7] [1,2]:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7feea2863d84]
[1,2]:[6b9dcc145d7b:00153] [ 8] [1,1]:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7fb22aec5d84]
[1,1]:[6b9dcc145d7b:00152] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fb293e5f609]
[1,1]:[6b9dcc145d7b:00152] [ 9] [1,3]:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7febedabbd84]
[1,3]:[6b9dcc145d7b:00154] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fec56a55609]
[1,3]:[6b9dcc145d7b:00154] [ 9] [1,2]:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fef0b7fd609]
[1,2]:[6b9dcc145d7b:00153] [ 9] [1,2]:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fef0b939293]
[1,2]:[6b9dcc145d7b:00153] *** End of error message ***
[1,1]:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fb293f9b293]
[1,1]:[6b9dcc145d7b:00152] *** End of error message ***
[1,3]:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fec56b91293]
[1,3]:[6b9dcc145d7b:00154] *** End of error message ***

czankel · May 11, 2021, 9:06pm

Hi Erik,

Sorry, I missed that you were actually on the r460 driver. The fix I mentioned before was for r450. Anyway, it’s always better to have the latest driver installed.

I talked to our internal teams and they asked if you could enable NCCL_DEBUG=info. The segfault is caused by ncclCommAbort. This seems to be due to some exception it caught. The debug flag might give us more details.

erik.johnsson · May 12, 2021, 5:18pm

Hi!

Here is the output with NCCL_DEBUG=info:

[1,0]:660382a73181:151:639 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,0]:660382a73181:151:639 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,0]:660382a73181:151:639 [0] NCCL INFO P2P plugin IBext
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/IB : No device found.
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/IB : No device found.
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,0]:660382a73181:151:639 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.9.6+cuda11.3
[1,2]:660382a73181:153:642 [2] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,2]:660382a73181:153:642 [2] NCCL INFO P2P plugin IBext
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,1]:660382a73181:152:640 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/IB : No device found.
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO Using network Socket
[1,3]:660382a73181:154:641 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,3]:660382a73181:154:641 [3] NCCL INFO P2P plugin IBext
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/IB : No device found.
[1,1]:660382a73181:152:640 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,1]:660382a73181:152:640 [1] NCCL INFO P2P plugin IBext
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,3]:660382a73181:154:641 [3] NCCL INFO Using network Socket
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/IB : No device found.
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,1]:660382a73181:152:640 [1] NCCL INFO Using network Socket
[1,1]:660382a73181:152:640 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/-1/-1->1->-1 [2] -1/-1/-1->1->2 [3] 2/-1/-1->1->-1 [4] -1/-1/-1->1->2 [5] 2/-1/-1->1->-1 [6] -1/-1/-1->1->2 [7] 2/-1/-1->1->-1
[1,1]:660382a73181:152:640 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
[1,2]:660382a73181:153:642 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 1/-1/-1->2->3 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 1/-1/-1->2->3 [7] 3/-1/-1->2->1
[1,2]:660382a73181:153:642 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff00,000fffff
[1,3]:660382a73181:154:641 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 0/-1/-1->3->2 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 0/-1/-1->3->2
[1,3]:660382a73181:154:641 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00/08 : 0 1 2 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 01/08 : 0 3 2 1
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 02/08 : 0 3 1 2
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03/08 : 0 2 1 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04/08 : 0 1 2 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 05/08 : 0 3 2 1
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 06/08 : 0 3 1 2
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07/08 : 0 2 1 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] -1/-1/-1->0->3 [2] 3/-1/-1->0->-1 [3] -1/-1/-1->0->3 [4] 3/-1/-1->0->-1 [5] -1/-1/-1->0->3 [6] 3/-1/-1->0->-1 [7] -1/-1/-1->0->3
[1,0]:660382a73181:151:639 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 00 : 3[b000] → 0[6000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 00 : 1[7000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 03 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 00 : 2[a000] → 3[b000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 02 : 1[7000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 04 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 04 : 2[a000] → 3[b000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 04 : 1[7000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00 : 0[6000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 07 : 3[b000] → 0[6000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 06 : 1[7000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04 : 0[6000] → 1[7000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 03 : 1[7000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03 : 0[6000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 07 : 1[7000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07 : 0[6000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 01 : 2[a000] → 1[7000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 01 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 03 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 01 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 02 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 05 : 2[a000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 01 : 1[7000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 05 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 05 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 07 : 2[a000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 05 : 1[7000] → 0[6000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 06 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Connected all rings
[1,1]:660382a73181:152:640 [1] NCCL INFO Connected all rings
[1,3]:660382a73181:154:641 [3] NCCL INFO Connected all rings
[1,0]:660382a73181:151:639 [0] NCCL INFO Connected all rings
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 01 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 03 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 05 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 07 : 1[7000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 01 : 2[a000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 03 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 01 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 05 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 05 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 07 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 0[6000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07 : 0[6000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 00 : 3[b000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 00 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 03 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 04 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 04 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 07 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Connected all trees
[1,0]:660382a73181:151:639 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,0]:660382a73181:151:639 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,1]:660382a73181:152:640 [1] NCCL INFO Connected all trees
[1,1]:660382a73181:152:640 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,1]:660382a73181:152:640 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,1]:660382a73181:152:640 [1] NCCL INFO comm 0x7fbdbb444480 rank 1 nranks 4 cudaDev 1 busId 7000 - Init COMPLETE
[1,0]:660382a73181:151:639 [0] NCCL INFO comm 0x7fbf734ca7c0 rank 0 nranks 4 cudaDev 0 busId 6000 - Init COMPLETE
[1,2]:660382a73181:153:642 [2] NCCL INFO Connected all trees
[1,2]:660382a73181:153:642 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,2]:660382a73181:153:642 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,3]:660382a73181:154:641 [3] NCCL INFO Connected all trees
[1,3]:660382a73181:154:641 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,3]:660382a73181:154:641 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,2]:660382a73181:153:642 [2] NCCL INFO comm 0x7efd834445d0 rank 2 nranks 4 cudaDev 2 busId a000 - Init COMPLETE
[1,3]:660382a73181:154:641 [3] NCCL INFO comm 0x7fe8ab43c160 rank 3 nranks 4 cudaDev 3 busId b000 - Init COMPLETE
[1,1]:
[1,1]:660382a73181:152:640 [1] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,1]:660382a73181:152:640 [1] NCCL INFO enqueue.cc:884 → 1
[1,2]:
[1,2]:660382a73181:153:642 [2] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,2]:660382a73181:153:642 [2] NCCL INFO enqueue.cc:884 → 1
[1,3]:
[1,3]:660382a73181:154:641 [3] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,3]:660382a73181:154:641 [3] NCCL INFO enqueue.cc:884 → 1
[1,0]:
[1,0]:660382a73181:151:639 [0] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,0]:660382a73181:151:639 [0] NCCL INFO enqueue.cc:884 → 1
[1,0]:
[1,0]:660382a73181:151:639 [0] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,0]:660382a73181:151:639 [0] NCCL INFO enqueue.cc:874 → 4
[1,0]:
[1,0]:660382a73181:151:639 [0] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,1]:
[1,1]:660382a73181:152:640 [1] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,1]:660382a73181:152:640 [1] NCCL INFO enqueue.cc:874 → 4
[1,1]:
[1,1]:660382a73181:152:640 [1] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,3]:
[1,3]:660382a73181:154:641 [3] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,3]:660382a73181:154:641 [3] NCCL INFO enqueue.cc:874 → 4
[1,3]:
[1,3]:660382a73181:154:641 [3] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,2]:
[1,2]:660382a73181:153:642 [2] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,2]:660382a73181:153:642 [2] NCCL INFO enqueue.cc:874 → 4
[1,2]:
[1,2]:660382a73181:153:642 [2] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’

Cliff_Woolley · May 12, 2021, 5:54pm

Thanks for the additional information. As best I can tell, it sounds like you’re using our 21.04 containers on a Tesla V100 system with an R460 driver now installed. The NCCL message you’re seeing here indicates that the container did not enter “forward compatibility” mode as we would expect, so you’re falling back to the “enhanced compatibility” mode, which NCCL unfortunately doesn’t yet support (Using CUDACHECK(cudaStreamGetCaptureInfo_v2(...)) breaks enhanced compatibility · Issue #496 · NVIDIA/nccl · GitHub).

It sounds to me like the next step here is simply for us to find out why the forward compatibility mode, which your system [as best I understand it] should support, isn’t kicking in.

Can you please provide the output of the following?

nvidia-smi
echo ${_CUDA_COMPAT_STATUS}
ls -al /usr/local/cuda/compat/

…from inside your running container?

Thanks,
Cliff

erik.johnsson · May 18, 2021, 2:14pm

Here is the output:

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

echo ${_CUDA_COMPAT_STATUS} —> no output

ls -al /usr/local/cuda/compat
total 12
drwxrwxrwx 1 root root 4096 Apr 22 22:08 .
drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

Cliff_Woolley · May 19, 2021, 8:12am

Thanks for the update. This is curious, because it looks like the compatibility check never even ran. Can you also share the exact command you use to start up the container in the first place, please?

Thanks,
Cliff

andras.tuzko.external · May 26, 2021, 7:29am

So we figured out that the problem was that we were setting the BASH_ENV environment variable to another file during the docker run. Sourcing the default bashrc (/etc/bash.bashrc) results in running the compatibility check.

After listing the compat dir from all the 4 ranks with which the job was started we saw that the symbolic link lib was only visible for the first 2 ranks:

ls -al /usr/local/cuda/compat
[1,2]:total 16
[1,2]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,2]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,2]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,2]:drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

[1,3]:total 16
[1,3]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,3]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,3]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,3]:drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

[1,1]:total 16
[1,1]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,1]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,1]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,1]:lrwxrwxrwx 1 root root 31 May 25 15:31 lib → /usr/local/cuda/compat/lib.real
[1,1]:drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

[1,0]:total 20
[1,0]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,0]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,0]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,0]:lrwxrwxrwx 1 root root 31 May 25 15:31 lib → /usr/local/cuda/compat/lib.real
[1,0]:drwxr-xr-x 1 root root 4096 May 25 15:31 lib.real

And then again we get this warning from the 2nd and 3rd ranks (and then the job does dead):
Cuda failure ‘API call is not supported in the installed CUDA driver’

Do you have an idea what could be the issue? Thanks!
Can you tell me how this compatibility check is working, how it gets triggered?

Edit: we ran the job again and now is the symbolic link from 3 ranks visible, is this somehow time related?

andras.tuzko.external · May 27, 2021, 2:01pm

Never mind we found a way with which this works for all the ranks. But anyway, the question with the compatibility check would still be interesting.

Topic		Replies	Views
More than 1 GPU not working using Tao Train TAO Toolkit	47	4514	April 9, 2023
WSL2 & TAO issues TAO Toolkit wsl , tao	27	3771	January 5, 2022
Code runs in RTX 3060 but not in 4xTesla T4 Azure cluster Microsoft Azure Image pytorch , python , cudnn	0	435	March 5, 2024
TAO API - Detectnet_v2 - Multi GPU Stuck TAO Toolkit	57	1794	August 29, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck - EXTRA GPU TAO Toolkit	14	980	November 7, 2023
kernel version 440.31.0 does not match DSO version 440.33.1 — cannot find working devices in this configuration Linux	4	20932	December 12, 2019
Error during multi-GPU training of classification_tf1: cma_ep.c process_vm_readv Operation not permitted TAO Toolkit	30	1848	June 1, 2023
NCCL error when training data in GCP GPU-Accelerated Libraries cuda , tensorflow , ubuntu , python	2	1374	August 23, 2024
Problems migrating to multi-gpu setting Deep Learning (Training & Inference) pytorch , python , cloud	1	1141	March 5, 2024
Multigpu training raises error TAO Toolkit	9	1121	November 15, 2022

ncclAllReduce failed: unhandled cuda error

Related topics