Invalid Loss

INFO: Starting Training Loop.
Epoch 1/450
d502f09cd598:89:185 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d502f09cd598:89:185 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d502f09cd598:89:185 [0] NCCL INFO P2P plugin IBext
d502f09cd598:89:185 [0] NCCL INFO NET/IB : No device found.
d502f09cd598:89:185 [0] NCCL INFO NET/IB : No device found.
d502f09cd598:89:185 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d502f09cd598:89:185 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
d502f09cd598:90:188 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d502f09cd598:90:188 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d502f09cd598:90:188 [1] NCCL INFO P2P plugin IBext
d502f09cd598:90:188 [1] NCCL INFO NET/IB : No device found.
d502f09cd598:90:188 [1] NCCL INFO NET/IB : No device found.
d502f09cd598:90:188 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d502f09cd598:90:188 [1] NCCL INFO Using network Socket
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 00/02 : 0 1
d502f09cd598:89:185 [0] NCCL INFO Channel 01/02 : 0 1
d502f09cd598:89:185 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
d502f09cd598:90:188 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Channel 00 : 1[a000] → 0[4000] via direct shared memory
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Channel 01 : 1[a000] → 0[4000] via direct shared memory
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 00 : 0[4000] → 1[a000] via direct shared memory
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 01 : 0[4000] → 1[a000] via direct shared memory
d502f09cd598:90:188 [1] NCCL INFO Connected all rings
d502f09cd598:90:188 [1] NCCL INFO Connected all trees
d502f09cd598:90:188 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
d502f09cd598:90:188 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
d502f09cd598:89:185 [0] NCCL INFO Connected all rings
d502f09cd598:89:185 [0] NCCL INFO Connected all trees
d502f09cd598:89:185 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
d502f09cd598:89:185 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
d502f09cd598:90:188 [1] NCCL INFO comm 0x7fced87dca00 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
d502f09cd598:89:185 [0] NCCL INFO comm 0x7fbed87dfc60 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
d502f09cd598:89:185 [0] NCCL INFO Launch mode Parallel
3/860 […] - ETA: 1:49:22 - loss: nan Batch 2: Invalid loss, terminating training

I am training a yolov4_tiny model.

Seems to be nan loss. Try to set a lower lr and try.

Earlier loss value was printed and it displayed invalid loss.
3/860 […] - ETA: 1:53:10 - loss: 7588.4917Batch 3: Invalid loss, terminating training
4/860 […] - ETA: 1:25:55 - loss: 7584.3136

Current Error after reducing the lr

Do you have a value to suggest?

Can you attach the training spec file?
And how many training images totally?

random_seed: 42
yolov4_config {
big_anchor_shape: “[(180.00, 154.00),(247.00, 176.00),(325.00, 213.00)]”
mid_anchor_shape: “[(208.00, 29.00),(70.00, 110.00),(260.00, 43.00)]”
small_anchor_shape: “[(42.00, 43.00),(54.00, 78.00),(73.00, 69.00)]”

box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “cspdarknet_tiny_3l”
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 10
num_epochs: 450
enable_qat: true
checkpoint_interval: 30
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 8e-5
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 10
matching_iou_threshold: 0.6
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 416
output_height: 416
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/*”
image_directory_path: “/workspace/tao-experiments/data/training/”
}
include_difficult_in_training: true
image_extension: “jpg”
target_class_mapping {
key: “face”
value: “face”
}

}

Total number of images: 19997

Please try a smaller bs.

Nothing helps.
I gave bs of 1 as well.

3/8599 […] - ETA: 10:07:30 - loss: 6006.5685Batch 3: Invalid loss, terminating training
4/8599 […] - ETA: 7:38:19 - loss: nan Batch 3: Invalid loss, terminating training
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.852390). Check your callbacks.
% delta_t_median)
INFO: Training finished successfully.
INFO: Training finished successfully.
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.852525). Check your callbacks.
% delta_t_median)
INFO: Training loop in progress
INFO: Training loop complete.
INFO: Training finished successfully.
INFO: Training finished successfully.
2022-06-20 16:01:51,071 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Waiting for a solution…

Did you ever run the default jupyter notebook successfully?
If yes, can you leverage its learning rate and batch-size?

More, please try sequence dataset format as well.

I have trained a couple of models using TAO’s different jupyter notebooks and considered the .txt files default LR for majority of the models trained. I do change the BS that best suits my requirements.

What did you mean by SEQUENCE dataset format?

I find one culprit on your spec.
For YOLOv4-tiny, if using cspdarknet_tiny arch, to align with anchor shapes, only big_grid_xy_extend and mid_grid_xy_extend should be provided; if using cspdarknet_tiny_3l arch, all of them should be provided.

So, please add small_grid_xy_extend to retry.

For sequence data format, refer to YOLOv4-tiny — TAO Toolkit 3.22.05 documentation

Hi Sequence data format gives me the following error
INFO: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by validation_fold requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by validation_fold requires the training label format to be TFRecords.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[19457,1],1]
Exit code: 1

2022-06-21 14:21:15,559 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

adding small_grid_xy_extend to the .txt file does nothing still throws the same error.

Now even for classification am facing the same issue.
dcda39c77a8:158:253 [1] NCCL INFO comm 0x7faf387dbc80 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO comm 0x7fd7c07e5810 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO Launch mode Parallel
1/3633 […] - ETA: 4:38:20 - loss: 2.9416 - acc: 0.0000e+00WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-06-21 08:18:43,918 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

204/3633 [>…] - ETA: 3:17 - loss: nan - acc: 0.0539^C

According to above logs, all is not the same issue.

Please share the spec file with sequence data format.

Hi, I am facing this issue while using multiple GPU’s.
Do you know what is causing this… Because earlier I have trained on Multi GPU’s and never faced this issue?

So, do you mean

  • you meet nan loss loss while training with multi gpus?
  • you can train yolov4_tiny with multi gpus earlier but meet issue this week?

Yes, you got it right.

I don’t know what happened all of a sudden. And also When I run the jupyter notebook, the container just stops once I run any commands on the notebook.
Is there an issue to fix it?