ncclAllReduce failed: unhandled cuda error

We are currently testing the latest nvidia tensorflow docker container (21.04) using some Tesla V100 GPUs with the latest driver of 460.32.03 with horovod. We are using the experimental_compile option for some tf.function in the training code, so the code is partially compiled with XLA.

We get the following error when starting the training:

tensorflow.python.framework.errors_impl[1,0]:.UnknownError[1,0]:: ncclAllReduce failed: unhandled cuda error
[1,0]: [[{{node DistributedGradientTape_Allreduce/cond_452/then/_3643/DistributedGradientTape_Allreduce/cond_452/HorovodAllreduce_gradient_tape_model_Conv2D_Conv2DBackpropFilter_0}}]] [Op:__inference_training_step_82823]

Do you have any idea what could go wrong? We did not replace any of the nvidia tools (cuda, cudnn, nccl, …) in the container.

Hi Erik,

We found some incompatibility issues between the driver and fabric-manager packages, and have pushed out an updated driver package to the repository. Could you try if that fixes the issue?

Hi,

Thanks for the quick answer. We tried upgrading the drivers to version 460.73.01 (is this the version you suggested?), but now we get a segfault:

[1,2]:[6b9dcc145d7b:153 :0:639] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[1,3]:[6b9dcc145d7b:154 :0:640] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[1,1]:[6b9dcc145d7b:152 :0:641] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[1,1]:==== backtrace (tid: 641) ====
[1,1]: 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fb15a322d24]
[1,1]: 1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7fb15a322eff]
[1,1]: 2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7fb15a323234]
[1,1]: 3 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fb293e6b3c0]
[1,1]: 4 /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66) [0x7fb19194f796]
[1,1]: 5 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54) [0x7fb15d179b94]
[1,1]: 6 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221) [0x7fb15d17a4a1]
[1,1]: 7 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d) [0x7fb15d14491d]
[1,1]: 8 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48) [0x7fb15d144cf8]
[1,1]: 9 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce) [0x7fb15d1208ce]
[1,3]:==== backtrace (tid: 640) ====
[1,3]: 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7feb1cf17d24]
[1,3]: 1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7feb1cf17eff]
[1,3]: 2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7feb1cf18234]
[1,3]: 3 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fec56a613c0]
[1,3]: 4 /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66) [0x7feb544dc796]
[1,3]: 5 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54) [0x7feb1fd54b94]
[1,3]: 6 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221) [0x7feb1fd554a1]
[1,3]: 7 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d) [0x7feb1fd1f91d]
[1,3]: 8 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48) [0x7feb1fd1fcf8]
[1,3]: 9 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce) [0x7feb1fcfb8ce]
[1,3]:10 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84) [0x7febedabbd84]
[1,3]:11 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fec56a55609]
[1,3]:12 /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fec56b91293]
[1,3]:=================================
[1,2]:==== backtrace (tid: 639) ====
[1,2]: 0 /usr/local/ucx/lib/libucs.so.0(ucs_handle_error+0x2a4) [0x7fedd00c0d24]
[1,2]: 1 /usr/local/ucx/lib/libucs.so.0(+0x27eff) [0x7fedd00c0eff]
[1,2]: 2 /usr/local/ucx/lib/libucs.so.0(+0x28234) [0x7fedd00c1234]
[1,2]: 3 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7fef0b8093c0]
[1,2]: 4 /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66) [0x7fee092ce796]
[1,2]: 5 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54) [0x7fedd2b16b94]
[1,2]: 6 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221) [0x7fedd2b174a1]
[1,2]: 7 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d) [0x7fedd2ae191d]
[1,2]: 8 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48) [0x7fedd2ae1cf8]
[1,2]: 9 /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce) [0x7fedd2abd8ce]
[1,2]:10 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84) [0x7feea2863d84]
[1,2]:11 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fef0b7fd609]
[1,2]:12 /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fef0b939293]
[1,2]:=================================
[1,1]:10 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84) [0x7fb22aec5d84]
[1,1]:11 /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7fb293e5f609]
[1,1]:12 /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fb293f9b293]
[1,1]:=================================
[1,1]:[6b9dcc145d7b:00152] *** Process received signal ***
[1,3]:[6b9dcc145d7b:00154] *** Process received signal ***
[1,3]:[6b9dcc145d7b:00154] Signal: Segmentation fault (11)
[1,3]:[6b9dcc145d7b:00154] Signal code: (-6)
[1,3]:[6b9dcc145d7b:00154] Failing at address: 0x9a
[1,2]:[6b9dcc145d7b:00153] *** Process received signal ***
[1,2]:[6b9dcc145d7b:00153] Signal: Segmentation fault (11)
[1,2]:[6b9dcc145d7b:00153] Signal code: (-6)
[1,2]:[6b9dcc145d7b:00153] Failing at address: 0x99
[1,1]:[6b9dcc145d7b:00152] Signal: Segmentation fault (11)
[1,1]:[6b9dcc145d7b:00152] Signal code: (-6)
[1,1]:[6b9dcc145d7b:00152] Failing at address: 0x98
[1,1]:[6b9dcc145d7b:00152] [ 0] [1,3]:[6b9dcc145d7b:00154] [ 0] [1,2]:[6b9dcc145d7b:00153] [ 0] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fef0b8093c0]
[1,2]:[6b9dcc145d7b:00153] [ 1] [1,1]:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fb293e6b3c0]
[1,1]:[6b9dcc145d7b:00152] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66)[0x7fb19194f796]
[1,1]:[6b9dcc145d7b:00152] [ 2] [1,3]:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x7fec56a613c0]
[1,3]:[6b9dcc145d7b:00154] [ 1] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66)[0x7feb544dc796]
[1,3]:[6b9dcc145d7b:00154] [ 2] [1,2]:/usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommAbort+0x66)[0x7fee092ce796]
[1,2]:[6b9dcc145d7b:00153] [ 2] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54)[0x7fb15d179b94]
[1,1]:[6b9dcc145d7b:00152] [ 3] [1,3]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54)[0x7feb1fd54b94]
[1,3]:[6b9dcc145d7b:00154] [ 3] [1,2]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common11NCCLContext10ErrorCheckESs12ncclResult_tRP8ncclComm+0x54)[0x7fedd2b16b94]
[1,2]:[6b9dcc145d7b:00153] [ 3] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221)[0x7fedd2b174a1]
[1,2]:[6b9dcc145d7b:00153] [ 4] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221)[0x7fb15d17a4a1]
[1,1]:[6b9dcc145d7b:00152] [ 4] [1,3]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x221)[0x7feb1fd554a1]
[1,3]:[6b9dcc145d7b:00154] [ 4] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fb15d14491d]
[1,1]:[6b9dcc145d7b:00152] [ 5] [1,2]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7fedd2ae191d]
[1,2]:[6b9dcc145d7b:00153] [ 5] [1,3]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7feb1fd1f91d]
[1,3]:[6b9dcc145d7b:00154] [ 5] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48)[0x7feb1fd1fcf8]
[1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48)[0x7fb15d144cf8]
[1,1]:[6b9dcc145d7b:00152] [ 6] [1,2]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x48)[0x7fedd2ae1cf8]
[1,2]:[6b9dcc145d7b:00153] [ 6] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce)[0x7fedd2abd8ce]
[1,3]:[6b9dcc145d7b:00154] [ 6] /usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce)[0x7feb1fcfb8ce]
[1,3]:[6b9dcc145d7b:00154] [1,1]:/usr/local/lib/python3.8/dist-packages/horovod/tensorflow/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x638ce)[0x7fb15d1208ce]
[1,1]:[6b9dcc145d7b:00152] [ 7] [1,2]:[6b9dcc145d7b:00153] [ 7] [1,3]:[ 7] [1,2]:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7feea2863d84]
[1,2]:[6b9dcc145d7b:00153] [ 8] [1,1]:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7fb22aec5d84]
[1,1]:[6b9dcc145d7b:00152] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fb293e5f609]
[1,1]:[6b9dcc145d7b:00152] [ 9] [1,3]:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6d84)[0x7febedabbd84]
[1,3]:[6b9dcc145d7b:00154] [ 8] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fec56a55609]
[1,3]:[6b9dcc145d7b:00154] [ 9] [1,2]:/usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x7fef0b7fd609]
[1,2]:[6b9dcc145d7b:00153] [ 9] [1,2]:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fef0b939293]
[1,2]:[6b9dcc145d7b:00153] *** End of error message ***
[1,1]:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fb293f9b293]
[1,1]:[6b9dcc145d7b:00152] *** End of error message ***
[1,3]:/usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7fec56b91293]
[1,3]:[6b9dcc145d7b:00154] *** End of error message ***

Hi Erik,

Sorry, I missed that you were actually on the r460 driver. The fix I mentioned before was for r450. Anyway, it’s always better to have the latest driver installed.

I talked to our internal teams and they asked if you could enable NCCL_DEBUG=info. The segfault is caused by ncclCommAbort. This seems to be due to some exception it caught. The debug flag might give us more details.

Hi!

Here is the output with NCCL_DEBUG=info:

[1,0]:660382a73181:151:639 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,0]:660382a73181:151:639 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,0]:660382a73181:151:639 [0] NCCL INFO P2P plugin IBext
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/IB : No device found.
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/IB : No device found.
[1,0]:660382a73181:151:639 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,0]:660382a73181:151:639 [0] NCCL INFO Using network Socket
[1,0]:NCCL version 2.9.6+cuda11.3
[1,2]:660382a73181:153:642 [2] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,2]:660382a73181:153:642 [2] NCCL INFO P2P plugin IBext
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,1]:660382a73181:152:640 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/IB : No device found.
[1,2]:660382a73181:153:642 [2] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,2]:660382a73181:153:642 [2] NCCL INFO Using network Socket
[1,3]:660382a73181:154:641 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,3]:660382a73181:154:641 [3] NCCL INFO P2P plugin IBext
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/IB : No device found.
[1,1]:660382a73181:152:640 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
[1,1]:660382a73181:152:640 [1] NCCL INFO P2P plugin IBext
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/IB : No device found.
[1,3]:660382a73181:154:641 [3] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,3]:660382a73181:154:641 [3] NCCL INFO Using network Socket
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/IB : No device found.
[1,1]:660382a73181:152:640 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
[1,1]:660382a73181:152:640 [1] NCCL INFO Using network Socket
[1,1]:660382a73181:152:640 [1] NCCL INFO Trees [0] -1/-1/-1->1->2 [1] 2/-1/-1->1->-1 [2] -1/-1/-1->1->2 [3] 2/-1/-1->1->-1 [4] -1/-1/-1->1->2 [5] 2/-1/-1->1->-1 [6] -1/-1/-1->1->2 [7] 2/-1/-1->1->-1
[1,1]:660382a73181:152:640 [1] NCCL INFO Setting affinity for GPU 1 to 0fffff00,000fffff
[1,2]:660382a73181:153:642 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 3/-1/-1->2->1 [2] 1/-1/-1->2->3 [3] 3/-1/-1->2->1 [4] 1/-1/-1->2->3 [5] 3/-1/-1->2->1 [6] 1/-1/-1->2->3 [7] 3/-1/-1->2->1
[1,2]:660382a73181:153:642 [2] NCCL INFO Setting affinity for GPU 2 to 0fffff00,000fffff
[1,3]:660382a73181:154:641 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 0/-1/-1->3->2 [2] 2/-1/-1->3->0 [3] 0/-1/-1->3->2 [4] 2/-1/-1->3->0 [5] 0/-1/-1->3->2 [6] 2/-1/-1->3->0 [7] 0/-1/-1->3->2
[1,3]:660382a73181:154:641 [3] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00/08 : 0 1 2 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 01/08 : 0 3 2 1
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 02/08 : 0 3 1 2
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03/08 : 0 2 1 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04/08 : 0 1 2 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 05/08 : 0 3 2 1
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 06/08 : 0 3 1 2
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07/08 : 0 2 1 3
[1,0]:660382a73181:151:639 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] -1/-1/-1->0->3 [2] 3/-1/-1->0->-1 [3] -1/-1/-1->0->3 [4] 3/-1/-1->0->-1 [5] -1/-1/-1->0->3 [6] 3/-1/-1->0->-1 [7] -1/-1/-1->0->3
[1,0]:660382a73181:151:639 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 00 : 3[b000] → 0[6000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 00 : 1[7000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 03 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 00 : 2[a000] → 3[b000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 02 : 1[7000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 04 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 04 : 2[a000] → 3[b000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 04 : 1[7000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00 : 0[6000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 07 : 3[b000] → 0[6000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 06 : 1[7000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04 : 0[6000] → 1[7000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 03 : 1[7000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03 : 0[6000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 07 : 1[7000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07 : 0[6000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 01 : 2[a000] → 1[7000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 01 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 03 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 01 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 02 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 05 : 2[a000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 01 : 1[7000] → 0[6000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 05 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 05 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 07 : 2[a000] → 1[7000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 05 : 1[7000] → 0[6000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 06 : 0[6000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Connected all rings
[1,1]:660382a73181:152:640 [1] NCCL INFO Connected all rings
[1,3]:660382a73181:154:641 [3] NCCL INFO Connected all rings
[1,0]:660382a73181:151:639 [0] NCCL INFO Connected all rings
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 01 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 03 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 05 : 1[7000] → 2[a000] via P2P/IPC
[1,1]:660382a73181:152:640 [1] NCCL INFO Channel 07 : 1[7000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 01 : 2[a000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 3[b000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 03 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 01 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 05 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 05 : 3[b000] → 0[6000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 07 : 2[a000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 0[6000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 00 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 03 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 04 : 0[6000] → 3[b000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Channel 07 : 0[6000] → 3[b000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 00 : 3[b000] → 2[a000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 02 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 00 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 03 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 02 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 04 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 04 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 06 : 3[b000] → 2[a000] via P2P/IPC
[1,2]:660382a73181:153:642 [2] NCCL INFO Channel 06 : 2[a000] → 1[7000] via P2P/IPC
[1,3]:660382a73181:154:641 [3] NCCL INFO Channel 07 : 3[b000] → 2[a000] via P2P/IPC
[1,0]:660382a73181:151:639 [0] NCCL INFO Connected all trees
[1,0]:660382a73181:151:639 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,0]:660382a73181:151:639 [0] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,1]:660382a73181:152:640 [1] NCCL INFO Connected all trees
[1,1]:660382a73181:152:640 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,1]:660382a73181:152:640 [1] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,1]:660382a73181:152:640 [1] NCCL INFO comm 0x7fbdbb444480 rank 1 nranks 4 cudaDev 1 busId 7000 - Init COMPLETE
[1,0]:660382a73181:151:639 [0] NCCL INFO comm 0x7fbf734ca7c0 rank 0 nranks 4 cudaDev 0 busId 6000 - Init COMPLETE
[1,2]:660382a73181:153:642 [2] NCCL INFO Connected all trees
[1,2]:660382a73181:153:642 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,2]:660382a73181:153:642 [2] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,3]:660382a73181:154:641 [3] NCCL INFO Connected all trees
[1,3]:660382a73181:154:641 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/512
[1,3]:660382a73181:154:641 [3] NCCL INFO 8 coll channels, 8 p2p channels, 2 p2p channels per peer
[1,2]:660382a73181:153:642 [2] NCCL INFO comm 0x7efd834445d0 rank 2 nranks 4 cudaDev 2 busId a000 - Init COMPLETE
[1,3]:660382a73181:154:641 [3] NCCL INFO comm 0x7fe8ab43c160 rank 3 nranks 4 cudaDev 3 busId b000 - Init COMPLETE
[1,1]:
[1,1]:660382a73181:152:640 [1] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,1]:660382a73181:152:640 [1] NCCL INFO enqueue.cc:884 → 1
[1,2]:
[1,2]:660382a73181:153:642 [2] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,2]:660382a73181:153:642 [2] NCCL INFO enqueue.cc:884 → 1
[1,3]:
[1,3]:660382a73181:154:641 [3] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,3]:660382a73181:154:641 [3] NCCL INFO enqueue.cc:884 → 1
[1,0]:
[1,0]:660382a73181:151:639 [0] enqueue.cc:802 NCCL WARN Cuda failure ‘API call is not supported in the installed CUDA driver’
[1,0]:660382a73181:151:639 [0] NCCL INFO enqueue.cc:884 → 1
[1,0]:
[1,0]:660382a73181:151:639 [0] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,0]:660382a73181:151:639 [0] NCCL INFO enqueue.cc:874 → 4
[1,0]:
[1,0]:660382a73181:151:639 [0] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,1]:
[1,1]:660382a73181:152:640 [1] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,1]:660382a73181:152:640 [1] NCCL INFO enqueue.cc:874 → 4
[1,1]:
[1,1]:660382a73181:152:640 [1] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,3]:
[1,3]:660382a73181:154:641 [3] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,3]:660382a73181:154:641 [3] NCCL INFO enqueue.cc:874 → 4
[1,3]:
[1,3]:660382a73181:154:641 [3] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’
[1,2]:
[1,2]:660382a73181:153:642 [2] misc/argcheck.cc:39 NCCL WARN AllReduce : invalid root 0 (root should be in the 0…-1 range)
[1,2]:660382a73181:153:642 [2] NCCL INFO enqueue.cc:874 → 4
[1,2]:
[1,2]:660382a73181:153:642 [2] init.cc:895 NCCL WARN Cuda failure ‘invalid device ordinal’

Thanks for the additional information. As best I can tell, it sounds like you’re using our 21.04 containers on a Tesla V100 system with an R460 driver now installed. The NCCL message you’re seeing here indicates that the container did not enter “forward compatibility” mode as we would expect, so you’re falling back to the “enhanced compatibility” mode, which NCCL unfortunately doesn’t yet support (Using CUDACHECK(cudaStreamGetCaptureInfo_v2(...)) breaks enhanced compatibility · Issue #496 · NVIDIA/nccl · GitHub).

It sounds to me like the next step here is simply for us to find out why the forward compatibility mode, which your system [as best I understand it] should support, isn’t kicking in.

Can you please provide the output of the following?

nvidia-smi
echo ${_CUDA_COMPAT_STATUS}
ls -al /usr/local/cuda/compat/

…from inside your running container?

Thanks,
Cliff

Here is the output:

nvidia-smi
Mon May 17 14:05:02 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… Off | 00000000:06:00.0 Off | 0 |
| N/A 37C P0 44W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 Tesla V100-SXM2… Off | 00000000:07:00.0 Off | 0 |
| N/A 38C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 Tesla V100-SXM2… Off | 00000000:0A:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 Tesla V100-SXM2… Off | 00000000:0B:00.0 Off | 0 |
| N/A 35C P0 42W / 300W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

echo ${_CUDA_COMPAT_STATUS} —> no output

ls -al /usr/local/cuda/compat
total 12
drwxrwxrwx 1 root root 4096 Apr 22 22:08 .
drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

Thanks for the update. This is curious, because it looks like the compatibility check never even ran. Can you also share the exact command you use to start up the container in the first place, please?

Thanks,
Cliff

So we figured out that the problem was that we were setting the BASH_ENV environment variable to another file during the docker run. Sourcing the default bashrc (/etc/bash.bashrc) results in running the compatibility check.

After listing the compat dir from all the 4 ranks with which the job was started we saw that the symbolic link lib was only visible for the first 2 ranks:

ls -al /usr/local/cuda/compat
[1,2]:total 16
[1,2]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,2]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,2]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,2]:drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

[1,3]:total 16
[1,3]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,3]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,3]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,3]:drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

[1,1]:total 16
[1,1]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,1]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,1]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,1]:lrwxrwxrwx 1 root root 31 May 25 15:31 lib → /usr/local/cuda/compat/lib.real
[1,1]:drwxr-xr-x 2 root root 4096 Apr 22 22:08 lib.real

[1,0]:total 20
[1,0]:drwxrwxrwx 1 root root 4096 May 25 15:31 .
[1,0]:drwxr-xr-x 1 root root 4096 Apr 22 22:20 …
[1,0]:-rw-rw-r-- 1 root root 0 May 25 15:31 .460.73.01.e0e0c8358302.checked
[1,0]:lrwxrwxrwx 1 root root 31 May 25 15:31 lib → /usr/local/cuda/compat/lib.real
[1,0]:drwxr-xr-x 1 root root 4096 May 25 15:31 lib.real

And then again we get this warning from the 2nd and 3rd ranks (and then the job does dead):
Cuda failure ‘API call is not supported in the installed CUDA driver’

Do you have an idea what could be the issue? Thanks!
Can you tell me how this compatibility check is working, how it gets triggered?

Edit: we ran the job again and now is the symbolic link from 3 ranks visible, is this somehow time related?

Never mind we found a way with which this works for all the ranks. But anyway, the question with the compatibility check would still be interesting.