Environment
Baremetal or Container (if container which image + tag): TAO v3 docker image
tao classification train
runs successfully but hangs at the very end. Here is the output:
666/668 [============================>.] - ETA: 0s - loss: 1.4168 - acc: 0.3551
667/668 [============================>.] - ETA: 0s - loss: 1.4170 - acc: 0.3546
668/668 [==============================] - 458s 686ms/step - loss: 1.4168 - acc: 0.3555 - val_loss: 1.4443 - val_acc: 0.3511
9b5c358325b1:101:129 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
9b5c358325b1:101:129 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
9b5c358325b1:101:129 [0] NCCL INFO NET/IB : No device found.
9b5c358325b1:101:129 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
9b5c358325b1:101:129 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
9b5c358325b1:101:129 [0] NCCL INFO Channel 00/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 01/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 02/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 03/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 04/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 05/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 06/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 07/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 08/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 09/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 10/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 11/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 12/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 13/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 14/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 15/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 16/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 17/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 18/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 19/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 20/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 21/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 22/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 23/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 24/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 25/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 26/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 27/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 28/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 29/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 30/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Channel 31/32 : 0
9b5c358325b1:101:129 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1->0->-1|-1->0->-1
9b5c358325b1:101:129 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
9b5c358325b1:101:129 [0] NCCL INFO comm 0x7f67d8328cb0 rank 0 nranks 1 cudaDev 0 busId 8000 - Init COMPLETE
It hangs for about 60 seconds before printing the output related to NCCL. Could this be due to a network timeout issue? Can I disable any network calls?