TAO hangs at end of training

Environment

Baremetal or Container (if container which image + tag): TAO v3 docker image

tao classification train runs successfully but hangs at the very end. Here is the output:

666/668 [============================>.] - ETA: 0s - loss: 1.4168 - acc: 0.3551
667/668 [============================>.] - ETA: 0s - loss: 1.4170 - acc: 0.3546
668/668 [==============================] - 458s 686ms/step - loss: 1.4168 - acc: 0.3555 - val_loss: 1.4443 - val_acc: 0.3511
9b5c358325b1:101:129 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
9b5c358325b1:101:129 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
9b5c358325b1:101:129 [0] NCCL INFO NET/IB : No device found.
9b5c358325b1:101:129 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
9b5c358325b1:101:129 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
9b5c358325b1:101:129 [0] NCCL INFO Channel 00/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 01/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 02/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 03/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 04/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 05/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 06/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 07/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 08/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 09/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 10/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 11/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 12/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 13/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 14/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 15/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 16/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 17/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 18/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 19/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 20/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 21/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 22/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 23/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 24/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 25/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 26/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 27/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 28/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 29/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 30/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 31/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1
9b5c358325b1:101:129 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
9b5c358325b1:101:129 [0] NCCL INFO comm 0x7f67d8328cb0 rank 0 nranks 1 cudaDev 0 busId 8000 - Init COMPLETE

It hangs for about 60 seconds before printing the output related to NCCL. Could this be due to a network timeout issue? Can I disable any network calls?

Hi,

This forum talks more about updates and issues related to TensorRT. We recommend you to please post your concern on TAO forum to get better help.

Thank you.

1 Like