Multi GPU's and invalid loss

Hi,
3/8599 […] - ETA: 10:07:30 - loss: 6006.5685Batch 3: Invalid loss, terminating training
4/8599 […] - ETA: 7:38:19 - loss: nan Batch 3: Invalid loss, terminating training
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.852390). Check your callbacks.
% delta_t_median)
INFO: Training finished successfully.
INFO: Training finished successfully.
/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (1.852525). Check your callbacks.
% delta_t_median)
INFO: Training loop in progress
INFO: Training loop complete.
INFO: Training finished successfully.
INFO: Training finished successfully.
2022-06-20 16:01:51,071 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

This issue is not yet fixed… I don’t get this issue when running on a single GPU.

1 Like

Can you share the latest training spec again? Full command and full log are also appreciated. Thanks.

********************ERROR MESSAGE
INFO: Starting Training Loop.
Epoch 1/450
c0e21c462967:89:185 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.13<0>
c0e21c462967:89:185 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
c0e21c462967:89:185 [0] NCCL INFO P2P plugin IBext
c0e21c462967:89:185 [0] NCCL INFO NET/IB : No device found.
c0e21c462967:89:185 [0] NCCL INFO NET/IB : No device found.
c0e21c462967:89:185 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.13<0>
c0e21c462967:89:185 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
c0e21c462967:90:187 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.13<0>
c0e21c462967:90:187 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
c0e21c462967:90:187 [1] NCCL INFO P2P plugin IBext
c0e21c462967:90:187 [1] NCCL INFO NET/IB : No device found.
c0e21c462967:90:187 [1] NCCL INFO NET/IB : No device found.
c0e21c462967:90:187 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.13<0>
c0e21c462967:90:187 [1] NCCL INFO Using network Socket
c0e21c462967:89:185 [0] NCCL INFO Channel 00/02 : 0 1
c0e21c462967:89:185 [0] NCCL INFO Channel 01/02 : 0 1
c0e21c462967:89:185 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
c0e21c462967:90:187 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
c0e21c462967:89:185 [0] NCCL INFO Channel 00 : 0[4000] → 1[7000] via P2P/IPC
c0e21c462967:89:185 [0] NCCL INFO Channel 01 : 0[4000] → 1[7000] via P2P/IPC
c0e21c462967:90:187 [1] NCCL INFO Channel 00 : 1[7000] → 0[4000] via P2P/IPC
c0e21c462967:90:187 [1] NCCL INFO Channel 01 : 1[7000] → 0[4000] via P2P/IPC
c0e21c462967:89:185 [0] NCCL INFO Connected all rings
c0e21c462967:90:187 [1] NCCL INFO Connected all rings
c0e21c462967:90:187 [1] NCCL INFO Connected all trees
c0e21c462967:90:187 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
c0e21c462967:90:187 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
c0e21c462967:89:185 [0] NCCL INFO Connected all trees
c0e21c462967:89:185 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
c0e21c462967:89:185 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
c0e21c462967:89:185 [0] NCCL INFO comm 0x7f171835d000 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
c0e21c462967:90:187 [1] NCCL INFO comm 0x7efec435aad0 rank 1 nranks 2 cudaDev 1 busId 7000 - Init COMPLETE
c0e21c462967:89:185 [0] NCCL INFO Launch mode Parallel
11/1026 […] - ETA: 1:25:24 - loss: 12770.1574Batch 11: Invalid loss, terminating training
12/1026 […] - ETA: 1:19:55 - loss: 12856.1866

******************************** TRAINING SPEC FILE
random_seed: 42
yolov4_config {
big_anchor_shape: “[(40.00, 112.00),(72.00, 201.00),(160.00, 320.00)]”
mid_anchor_shape: “[(22.00, 32.00),(20.00, 70.00),(35.00, 53.00)]”
small_anchor_shape: “[(10.00, 13.00),(14.00, 21.00),(12.00, 39.00)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “cspdarknet_tiny_3l”
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
small_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
visualizer {
enabled: False
num_images: 3
}
batch_size_per_gpu: 16
num_epochs: 450
enable_qat: true
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.6
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 640
output_height: 480
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
include_difficult_in_training: true
image_extension: “jpg”
target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “face”
value: “face”
}

}

***************************************** COMMAND

print(“To run with multigpu, please change --gpus based on the number of available GPUs in your machine.”)
!tao yolo_v4_tiny train -e $SPECS_DIR/yolo_v4_tiny_train_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 2 --gpu_index 0 1

This is issue was not solved before.

To narrow down, please try arch: “cspdarknet_tiny” as well. Remember to delete small_anchor_shape and small_grid_xy_extend

I also wanted to let you know that the GPU is being used but still displays the error message

INFO: Starting Training Loop.
Epoch 1/450
b2ed001bbae6:89:188 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.18<0>
b2ed001bbae6:89:188 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
b2ed001bbae6:89:188 [0] NCCL INFO P2P plugin IBext
b2ed001bbae6:89:188 [0] NCCL INFO NET/IB : No device found.
b2ed001bbae6:89:188 [0] NCCL INFO NET/IB : No device found.
b2ed001bbae6:89:188 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.18<0>
b2ed001bbae6:89:188 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
b2ed001bbae6:90:185 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.18<0>
b2ed001bbae6:90:185 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
b2ed001bbae6:90:185 [1] NCCL INFO P2P plugin IBext
b2ed001bbae6:90:185 [1] NCCL INFO NET/IB : No device found.
b2ed001bbae6:90:185 [1] NCCL INFO NET/IB : No device found.
b2ed001bbae6:90:185 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.18<0>
b2ed001bbae6:90:185 [1] NCCL INFO Using network Socket
b2ed001bbae6:89:188 [0] NCCL INFO Channel 00/02 : 0 1
b2ed001bbae6:89:188 [0] NCCL INFO Channel 01/02 : 0 1
b2ed001bbae6:89:188 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
b2ed001bbae6:90:185 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
b2ed001bbae6:89:188 [0] NCCL INFO Channel 00 : 0[4000] → 1[7000] via P2P/IPC
b2ed001bbae6:89:188 [0] NCCL INFO Channel 01 : 0[4000] → 1[7000] via P2P/IPC
b2ed001bbae6:90:185 [1] NCCL INFO Channel 00 : 1[7000] → 0[4000] via P2P/IPC
b2ed001bbae6:90:185 [1] NCCL INFO Channel 01 : 1[7000] → 0[4000] via P2P/IPC
b2ed001bbae6:90:185 [1] NCCL INFO Connected all rings
b2ed001bbae6:90:185 [1] NCCL INFO Connected all trees
b2ed001bbae6:90:185 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
b2ed001bbae6:90:185 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
b2ed001bbae6:89:188 [0] NCCL INFO Connected all rings
b2ed001bbae6:89:188 [0] NCCL INFO Connected all trees
b2ed001bbae6:89:188 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
b2ed001bbae6:89:188 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
b2ed001bbae6:90:185 [1] NCCL INFO comm 0x7f565035a000 rank 1 nranks 2 cudaDev 1 busId 7000 - Init COMPLETE
b2ed001bbae6:89:188 [0] NCCL INFO comm 0x7f550035e000 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
b2ed001bbae6:89:188 [0] NCCL INFO Launch mode Parallel
11/1026 […] - ETA: 1:16:58 - loss: nan Batch 10: Invalid loss, terminating training

To narrow down, could you try to use the spec file in jupyter notebook (wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.4.0/zip -O cv_samples_v1.4.0.zip
unzip -u cv_samples_v1.4.0.zip -d ./cv_samples_v1.4.0 && rm -rf cv_samples_v1.4.0.zip && cd ./cv_samples_v1.4.0) and train with public KITTI dataset mentioned in the notebook?

INFO: Starting Training Loop.
Epoch 1/80
7f16530e3fda:90:186 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.19<0>
7f16530e3fda:90:186 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
7f16530e3fda:90:186 [0] NCCL INFO P2P plugin IBext
7f16530e3fda:90:186 [0] NCCL INFO NET/IB : No device found.
7f16530e3fda:90:186 [0] NCCL INFO NET/IB : No device found.
7f16530e3fda:90:186 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.19<0>
7f16530e3fda:90:186 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
7f16530e3fda:91:188 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.19<0>
7f16530e3fda:91:188 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
7f16530e3fda:91:188 [1] NCCL INFO P2P plugin IBext
7f16530e3fda:91:188 [1] NCCL INFO NET/IB : No device found.
7f16530e3fda:91:188 [1] NCCL INFO NET/IB : No device found.
7f16530e3fda:91:188 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.19<0>
7f16530e3fda:91:188 [1] NCCL INFO Using network Socket
7f16530e3fda:90:186 [0] NCCL INFO Channel 00/02 : 0 1
7f16530e3fda:90:186 [0] NCCL INFO Channel 01/02 : 0 1
7f16530e3fda:90:186 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
7f16530e3fda:91:188 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
7f16530e3fda:90:186 [0] NCCL INFO Channel 00 : 0[4000] → 1[7000] via P2P/IPC
7f16530e3fda:91:188 [1] NCCL INFO Channel 00 : 1[7000] → 0[4000] via P2P/IPC
7f16530e3fda:90:186 [0] NCCL INFO Channel 01 : 0[4000] → 1[7000] via P2P/IPC
7f16530e3fda:91:188 [1] NCCL INFO Channel 01 : 1[7000] → 0[4000] via P2P/IPC
7f16530e3fda:91:188 [1] NCCL INFO Connected all rings
7f16530e3fda:91:188 [1] NCCL INFO Connected all trees
7f16530e3fda:91:188 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
7f16530e3fda:91:188 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
7f16530e3fda:90:186 [0] NCCL INFO Connected all rings
7f16530e3fda:90:186 [0] NCCL INFO Connected all trees
7f16530e3fda:90:186 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
7f16530e3fda:90:186 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
7f16530e3fda:91:188 [1] NCCL INFO comm 0x7f038835f1c0 rank 1 nranks 2 cudaDev 1 busId 7000 - Init COMPLETE
7f16530e3fda:90:186 [0] NCCL INFO comm 0x7f672c3605d0 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
7f16530e3fda:90:186 [0] NCCL INFO Launch mode Parallel
Batch 1: Invalid loss, terminating training- ETA: 58:36:31 - loss: 5054.21
2/3741 […] - ETA: 109:00:03 - loss: nan Batch 1: Invalid loss, terminating training
INFO: Training loop in progress
INFO: Training loop complete.
INFO: Training finished successfully.
INFO: Training finished successfully.
INFO: Training finished successfully.
INFO: Training finished successfully.
2022-07-01 16:49:32,831 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

It is not expected. Can you share your training spec?

Try increase output.
I got this issue when training with low resolution and using --apm.
Try to use default output.

Hi, I have the same issue with the SSD model, using 2 GPU cards in multiGPU mode from TAO:

c0b4f9ebed1:100:231 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer
fc0b4f9ebed1:101:228 [1] NCCL INFO comm 0x7f8c107e3520 rank 1 nranks 2 cudaDev 1 busId 4b000 - Init COMPLETE
fc0b4f9ebed1:100:231 [0] NCCL INFO comm 0x7f686c7e4f80 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
fc0b4f9ebed1:100:231 [0] NCCL INFO Launch mode Parallel
1/108 […] - ETA: 41:44 - loss: 51.5332WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

2022-07-01 23:14:56,184 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py:186: The name tf.Summary is deprecated. Please use tf.compat.v1.Summary instead.

13/108 [==>…] - ETA: 4:20 - loss: 19388248.4332Batch 13: Invalid loss, terminating training
May be this it no a model issue because my model worked before with multigpu mode.

@christian41
Please create a new topic for your case. And share your training spec, training command and full log in that topic as well. Thanks.

okay!

***************************Training spec

random_seed: 42
yolov4_config {
big_anchor_shape: “[(71.31, 41.96),(123.55, 80.05),(257.84, 171.25)]”
mid_anchor_shape: “[(18.88, 17.11),(38.78, 26.77),(30.48, 71.27)]”
box_matching_iou: 0.25
matching_neutral_box_iou: 0.5
arch: “cspdarknet_tiny”
loss_loc_weight: 1.0
loss_neg_obj_weights: 1.0
loss_class_weights: 1.0
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.05
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
visualizer {
enabled: False
num_images: 3
}
batch_size_per_gpu: 1
num_epochs: 80
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
pretrain_model_path: “/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5”
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
force_on_cpu: true
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1248
output_height: 384
output_channel: 3
randomize_input_shape_period: 10
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/training/tfrecords/*”
image_directory_path: “/workspace/tao-experiments/data/training”
}
include_difficult_in_training: true
image_extension: “png”
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “truck”
value: “truck”
}
target_class_mapping {
key: “dontcare”
value: “dontcare”
}
target_class_mapping {
key: “pedestrian”
value: “pedestrian”
}
target_class_mapping {
key: “cyclist”
value: “cyclist”
}

target_class_mapping {
key: “van”
value: “van”
}

target_class_mapping {
key: “tram”
value: “tram”
}

target_class_mapping {
key: “misc”
value: “misc”
}

target_class_mapping {
key: “person_sitting”
value: “person_sitting”
}
validation_data_sources: {
tfrecords_path: “/workspace/tao-experiments/data/val/tfrecords/*”
image_directory_path: “/workspace/tao-experiments/data/val”
}
}

Hi ,
Could you try below way instead? Login the docker and run the training. I want to check if the issue is related to tao launcher.

  1. Open a terminal.
  2. $ tao yolo_v4_tiny run /bin/bash
  3. check the gpus via “nvidia-smi”
  4. then,
    yolo_v4_tiny train -e xxx -r xxx -k xxx --gpus 2 --gpu_index 0 1

Please share all the log with me. Thanks a lot!

More, you mentioned above that the nan loss even happens in public KITTI dataset, did you ever try on detectnet_v2 network? Just want to narrow down.

I haven’t tried for detectnet_v2.
I am using yolov4_tiny currently.

OK. Today, I use 2gpus to run yolov4_tiny training with your spec file(Multi GPU's and invalid loss - #15 by rishika.v) but still cannot reproduce any error. This training runs against public KITTI dataset.

Suggest you to generate tfrecords files again.
And also

  1. Open a terminal.
  2. $ tao yolo_v4_tiny run /bin/bash
  3. check the gpus via “nvidia-smi”
  4. then,
    yolo_v4_tiny train -e xxx -r xxx -k xxx --gpus 2 --gpu_index 0 1

Please share all the log with me. Thanks a lot!

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.