TAO hangs at end of training

Environment

Baremetal or Container (if container which image + tag): TAO v3 docker image

tao classification train runs successfully but hangs at the very end. Here is the output:

666/668 [============================>.] - ETA: 0s - loss: 1.4168 - acc: 0.3551
667/668 [============================>.] - ETA: 0s - loss: 1.4170 - acc: 0.3546
668/668 [==============================] - 458s 686ms/step - loss: 1.4168 - acc: 0.3555 - val_loss: 1.4443 - val_acc: 0.3511
9b5c358325b1:101:129 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
9b5c358325b1:101:129 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
9b5c358325b1:101:129 [0] NCCL INFO NET/IB : No device found.
9b5c358325b1:101:129 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
9b5c358325b1:101:129 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
9b5c358325b1:101:129 [0] NCCL INFO Channel 00/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 01/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 02/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 03/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 04/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 05/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 06/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 07/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 08/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 09/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 10/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 11/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 12/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 13/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 14/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 15/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 16/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 17/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 18/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 19/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 20/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 21/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 22/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 23/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 24/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 25/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 26/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 27/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 28/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 29/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 30/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Channel 31/32 :    0
9b5c358325b1:101:129 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1
9b5c358325b1:101:129 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
9b5c358325b1:101:129 [0] NCCL INFO comm 0x7f67d8328cb0 rank 0 nranks 1 cudaDev 0 busId 8000 - Init COMPLETE

It hangs for about 60 seconds before printing the output related to NCCL. Could this be due to a network timeout issue? Can I disable any network calls?

Did you run with WSL?

What is WSL?

Could you double check, for example, try 1 epoch ?

I have run it many times with the same result. What exactly do you want me to double check?

You mention that “hangs at the very end”. How many epochs did you train? Does it hang at 98 or 99th if you train for 100 epochs?

It hangs at both, but the above output is only displayed at the the end of the first epoch.

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Sorry, I am a little confused. If you train for 100 epochs, how about its result? Does 1st epoch, 2nd epoch , 98th epoch work well?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.