Invalid Loss

rishika.v · June 20, 2022, 8:48am

INFO: Starting Training Loop.
Epoch 1/450
d502f09cd598:89:185 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d502f09cd598:89:185 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d502f09cd598:89:185 [0] NCCL INFO P2P plugin IBext
d502f09cd598:89:185 [0] NCCL INFO NET/IB : No device found.
d502f09cd598:89:185 [0] NCCL INFO NET/IB : No device found.
d502f09cd598:89:185 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d502f09cd598:89:185 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
d502f09cd598:90:188 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d502f09cd598:90:188 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d502f09cd598:90:188 [1] NCCL INFO P2P plugin IBext
d502f09cd598:90:188 [1] NCCL INFO NET/IB : No device found.
d502f09cd598:90:188 [1] NCCL INFO NET/IB : No device found.
d502f09cd598:90:188 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d502f09cd598:90:188 [1] NCCL INFO Using network Socket
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 00/02 : 0 1
d502f09cd598:89:185 [0] NCCL INFO Channel 01/02 : 0 1
d502f09cd598:89:185 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
d502f09cd598:90:188 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Channel 00 : 1[a000] → 0[4000] via direct shared memory
d502f09cd598:90:188 [1] NCCL INFO Could not enable P2P between dev 1(=a000) and dev 0(=4000)
d502f09cd598:90:188 [1] NCCL INFO Channel 01 : 1[a000] → 0[4000] via direct shared memory
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 00 : 0[4000] → 1[a000] via direct shared memory
d502f09cd598:89:185 [0] NCCL INFO Could not enable P2P between dev 0(=4000) and dev 1(=a000)
d502f09cd598:89:185 [0] NCCL INFO Channel 01 : 0[4000] → 1[a000] via direct shared memory
d502f09cd598:90:188 [1] NCCL INFO Connected all rings
d502f09cd598:90:188 [1] NCCL INFO Connected all trees
d502f09cd598:90:188 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
d502f09cd598:90:188 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
d502f09cd598:89:185 [0] NCCL INFO Connected all rings
d502f09cd598:89:185 [0] NCCL INFO Connected all trees
d502f09cd598:89:185 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
d502f09cd598:89:185 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
d502f09cd598:90:188 [1] NCCL INFO comm 0x7fced87dca00 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
d502f09cd598:89:185 [0] NCCL INFO comm 0x7fbed87dfc60 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
d502f09cd598:89:185 [0] NCCL INFO Launch mode Parallel
3/860 […] - ETA: 1:49:22 - loss: nan Batch 2: Invalid loss, terminating training

I am training a yolov4_tiny model.

Morganh · June 20, 2022, 9:11am

Seems to be nan loss. Try to set a lower lr and try.

rishika.v · June 20, 2022, 9:12am

Earlier loss value was printed and it displayed invalid loss.
3/860 […] - ETA: 1:53:10 - loss: 7588.4917Batch 3: Invalid loss, terminating training
4/860 […] - ETA: 1:25:55 - loss: 7584.3136

Current Error after reducing the lr

rishika.v · June 20, 2022, 9:13am

Do you have a value to suggest?

Morganh · June 20, 2022, 9:14am

Can you attach the training spec file?
And how many training images totally?

rishika.v · June 20, 2022, 9:17am

random_seed: 42
yolov4_config {
big_anchor_shape: “[(180.00, 154.00),(247.00, 176.00),(325.00, 213.00)]”
mid_anchor_shape: “[(208.00, 29.00),(70.00, 110.00),(260.00, 43.00)]”
small_anchor_shape: “[(42.00, 43.00),(54.00, 78.00),(73.00, 69.00)]”

box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “cspdarknet_tiny_3l”
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 10
num_epochs: 450
enable_qat: true
checkpoint_interval: 30
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-6
max_learning_rate: 8e-5
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 10
matching_iou_threshold: 0.6
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 416
output_height: 416
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/*”
image_directory_path: “/workspace/tao-experiments/data/training/”
}
include_difficult_in_training: true
image_extension: “jpg”
target_class_mapping {
key: “face”
value: “face”
}

}

Total number of images: 19997

Morganh · June 20, 2022, 10:23am

Please try a smaller bs.

rishika.v · June 20, 2022, 11:05am

Nothing helps.
I gave bs of 1 as well.

3/8599 […] - ETA: 10:07:30 - loss: 6006.5685Batch 3: Invalid loss, terminating training
4/8599 […] - ETA: 7:38:19 - loss: nan Batch 3: Invalid loss, terminating training
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.852390). Check your callbacks.
% delta_t_median)
INFO: Training finished successfully.
INFO: Training finished successfully.
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.852525). Check your callbacks.
% delta_t_median)
INFO: Training loop in progress
INFO: Training loop complete.
INFO: Training finished successfully.
INFO: Training finished successfully.
2022-06-20 16:01:51,071 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

rishika.v · June 20, 2022, 11:33am

Waiting for a solution…

Morganh · June 20, 2022, 1:03pm

Did you ever run the default jupyter notebook successfully?
If yes, can you leverage its learning rate and batch-size?

More, please try sequence dataset format as well.

rishika.v · June 21, 2022, 6:26am

I have trained a couple of models using TAO’s different jupyter notebooks and considered the .txt files default LR for majority of the models trained. I do change the BS that best suits my requirements.

rishika.v · June 21, 2022, 6:26am

What did you mean by SEQUENCE dataset format?

Morganh · June 21, 2022, 6:56am

I find one culprit on your spec.
For YOLOv4-tiny, if using cspdarknet_tiny arch, to align with anchor shapes, only big_grid_xy_extend and mid_grid_xy_extend should be provided; if using cspdarknet_tiny_3l arch, all of them should be provided.

So, please add small_grid_xy_extend to retry.

For sequence data format, refer to YOLOv4-tiny — TAO Toolkit 3.22.05 documentation

rishika.v · June 21, 2022, 8:54am

Hi Sequence data format gives me the following error
INFO: Validation dataset specified by `validation_fold` requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by `validation_fold` requires the training label format to be TFRecords.
Traceback (most recent call last):
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 145, in
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 707, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py”, line 695, in return_func
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 141, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 126, in main
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py”, line 63, in run_experiment
File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/utils/spec_loader.py”, line 57, in load_experiment_spec
AssertionError: Validation dataset specified by `validation_fold` requires the training label format to be TFRecords.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[19457,1],1]
Exit code: 1

2022-06-21 14:21:15,559 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

rishika.v · June 21, 2022, 8:55am

adding small_grid_xy_extend to the .txt file does nothing still throws the same error.

rishika.v · June 21, 2022, 9:00am

Now even for classification am facing the same issue.
dcda39c77a8:158:253 [1] NCCL INFO comm 0x7faf387dbc80 rank 1 nranks 2 cudaDev 1 busId a000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO comm 0x7fd7c07e5810 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
edcda39c77a8:157:256 [0] NCCL INFO Launch mode Parallel
1/3633 […] - ETA: 4:38:20 - loss: 2.9416 - acc: 0.0000e+00WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-06-21 08:18:43,918 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

204/3633 [>…] - ETA: 3:17 - loss: nan - acc: 0.0539^C

Morganh · June 21, 2022, 9:09am

According to above logs, all is not the same issue.

Please share the spec file with sequence data format.

rishika.v · June 22, 2022, 10:29am

Hi, I am facing this issue while using multiple GPU’s.
Do you know what is causing this… Because earlier I have trained on Multi GPU’s and never faced this issue?

Morganh · June 22, 2022, 3:28pm

So, do you mean

you meet nan loss loss while training with multi gpus?
you can train yolov4_tiny with multi gpus earlier but meet issue this week?

rishika.v · June 23, 2022, 6:13am

Yes, you got it right.

I don’t know what happened all of a sudden. And also When I run the jupyter notebook, the container just stops once I run any commands on the notebook.
Is there an issue to fix it?

Topic		Replies	Views
Multi GPU's and invalid loss TAO Toolkit	18	1165	July 19, 2022
Multigpu training raises error TAO Toolkit	9	1088	November 15, 2022
6abdae4a2479:150:606 [3] NCCL INFO Call to connect returned Connection refused, retrying TAO Toolkit	29	2682	February 3, 2022
WSL2 & TAO issues TAO Toolkit wsl , tao	27	3736	January 5, 2022
Tao Training failing on creating directory on a standard example TAO Toolkit tao	10	723	September 6, 2022
Errors encountered when using TAO to train LPRnet TAO Toolkit	19	694	November 17, 2021
Tao pre-trained yolo4tiny - AssertionError: Must have more boxes than clusters TAO Toolkit	54	2264	January 21, 2022
Training emotionnet with tao toolkit through Jupyter Notebook TAO Toolkit	26	886	December 12, 2022
Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors TAO Toolkit yolo	20	870	August 29, 2022
UffParser: Validator error: block_4c_bn_3/cond/Switch: Unsupported operation _Switch TAO Toolkit tensorrt	38	1371	January 11, 2022

Invalid Loss

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[19457,1],1] Exit code: 1

Related topics

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

Process name: [[19457,1],1]
Exit code: 1