INFO: Starting Training Loop.
Epoch 1/450
d502f09cd598:89:185 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d502f09cd598:89:185 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d502f09cd598:89:185 [0] NCCL INFO P2P plugin IBext
d502f09cd598:89:185 [0] NCCL INFO NET/IB : No device found.
d502f09cd598:89:185 [0] NCCL INFO NET/IB : No device found.
d502f09cd598:89:185 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d502f09cd598:89:185 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
d502f09cd598:90:188 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d502f09cd598:90:188 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d502f09cd598:90:188 [1] NCCL INFO P2P plugin IBext
d502f09cd598:90:188 [1] NCCL INFO NET/IB : No device found.
d502f09cd598:90:188 [1] NCCL INFO NET/IB : No device found.
d502f09cd598:90:188 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d502f09cd598:90:188 [1] NCCL INFO Using network Socket
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 00/02 : 0 1
d502f09cd598:89:185 [0] NCCL INFO Channel 01/02 : 0 1
d502f09cd598:89:185 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
d502f09cd598:90:188 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Channel 00 : 1[a000] → 0[4000] via direct shared memory
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Channel 01 : 1[a000] → 0[4000] via direct shared memory
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 00 : 0[4000] → 1[a000] via direct shared memory
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 01 : 0[4000] → 1[a000] via direct shared memory
d502f09cd598:90:188 [1] NCCL INFO Connected all rings
d502f09cd598:90:188 [1] NCCL INFO Connected all trees
d502f09cd598:90:188 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
d502f09cd598:90:188 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
d502f09cd598:89:185 [0] NCCL INFO Connected all rings
d502f09cd598:89:185 [0] NCCL INFO Connected all trees
d502f09cd598:89:185 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
d502f09cd598:89:185 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
d502f09cd598:90:188 [1] NCCL INFO comm 0x7fced87dca00 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
d502f09cd598:89:185 [0] NCCL INFO comm 0x7fbed87dfc60 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
d502f09cd598:89:185 [0] NCCL INFO Launch mode Parallel
3/860 […] - ETA: 1:49:22 - loss: nan Batch 2: Invalid loss, terminating training
I have trained a couple of models using TAO’s different jupyter notebooks and considered the .txt files default LR for majority of the models trained. I do change the BS that best suits my requirements.
I find one culprit on your spec.
For YOLOv4-tiny, if using cspdarknet_tiny arch, to align with anchor shapes, only big_grid_xy_extend and mid_grid_xy_extend should be provided; if using cspdarknet_tiny_3l arch, all of them should be provided.
Hi Sequence data format gives me the following error
INFO: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Now even for classification am facing the same issue.
dcda39c77a8:158:253 [1] NCCL INFO comm 0x7faf387dbc80 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO comm 0x7fd7c07e5810 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO Launch mode Parallel
1/3633 […] - ETA: 4:38:20 - loss: 2.9416 - acc: 0.0000e+00WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
2022-06-21 08:18:43,918 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.
Hi, I am facing this issue while using multiple GPU’s.
Do you know what is causing this… Because earlier I have trained on Multi GPU’s and never faced this issue?
I don’t know what happened all of a sudden. And also When I run the jupyter notebook, the container just stops once I run any commands on the notebook.
Is there an issue to fix it?