Yolo_v4_tiny randomly stops docker container during second or third validation phase with no errors

I’m running yolo_V4_tiny and saving checkpoint at each epoch. During second epoch validation docker container stops with no error message. Then I can train it from the latest checkpoint and then it will pass exactly two epochs and during validation phase will crash again and again.

• Running on cloud virtual machine
• Network Type (Yolo_v4_tiny)
Training spec file:

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(273.00, 289.47),(325.65, 320.67),(324.02, 392.31)]"
  mid_anchor_shape: "[(272.03, 220.71),(250.57, 249.02),(303.55, 256.53)]"
  box_matching_iou: 0.25
  matching_neutral_box_iou: 0.5
  arch: "cspdarknet_tiny"
  loss_loc_weight: 1.0
  loss_neg_obj_weights: 1.0
  loss_class_weights: 1.0
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.05
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  visualizer {
      enabled: False
      num_images: 3
  }
  batch_size_per_gpu: 16
  num_epochs: 80
  enable_qat: false
  checkpoint_interval: 1
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  pretrain_model_path: "/workspace/tao-experiments/yolo_v4_tiny/pretrained_cspdarknet_tiny/pretrained_object_detection_vcspdarknet_tiny/cspdarknet_tiny.hdf5"
  resume_model_path: "/workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned_v2_small/weights/yolov4_cspdarknet_tiny_epoch_010.tlt"
}

eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  force_on_cpu: true
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 960
  output_height: 544
  output_channel: 3
  randomize_input_shape_period: 10
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/training/tfrecords/small-yolo/train*"
      image_directory_path: "/workspace/tao-experiments/data/car_detector/car_detector_dataset_small_v2/train"
  }
  include_difficult_in_training: true
  image_extension: "jpg"
  target_class_mapping {
      key: "car"
      value: "car"
  }
  target_class_mapping {
      key: "van"
      value: "car"
  }
  validation_data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/training/tfrecords/small-yolo/test*"
      image_directory_path: "/workspace/tao-experiments/data/car_detector/car_detector_dataset_small_v2/test"
  }
}

Train log:

INFO: Starting Training Loop.
Epoch 1/80
  2/348 [..............................] - ETA: 2:15:32 - loss: 5216.3418/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (2.660237). Check your callbacks.
  % delta_t_median)
348/348 [==============================] - 1127s 3s/step - loss: 5652.0640
d9493ce8e4ae:54:85 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.5<0>
d9493ce8e4ae:54:85 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d9493ce8e4ae:54:85 [0] NCCL INFO P2P plugin IBext
d9493ce8e4ae:54:85 [0] NCCL INFO NET/IB : No device found.
d9493ce8e4ae:54:85 [0] NCCL INFO NET/IB : No device found.
d9493ce8e4ae:54:85 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.5<0>
d9493ce8e4ae:54:85 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 00/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 01/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 02/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 03/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 04/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 05/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 06/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 07/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 08/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 09/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 10/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 11/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 12/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 13/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 14/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 15/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 16/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 17/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 18/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 19/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 20/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 21/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 22/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 23/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 24/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 25/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 26/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 27/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 28/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 29/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 30/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Channel 31/32 :    0
d9493ce8e4ae:54:85 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
d9493ce8e4ae:54:85 [0] NCCL INFO Connected all rings
d9493ce8e4ae:54:85 [0] NCCL INFO Connected all trees
d9493ce8e4ae:54:85 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
d9493ce8e4ae:54:85 [0] NCCL INFO comm 0x7ffad8791d00 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
INFO: Training loop in progress
Epoch 2/80
348/348 [==============================] - 1074s 3s/step - loss: 2554.9639
Producing predictions: 100%|██████████████████| 224/224 [01:43<00:00,  2.15it/s]
Start to calculate AP for each class
*******************************
car           AP    0.01674
              mAP   0.01674
*******************************
Validation loss: 2729.0912551879883

Epoch 00002: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned_v2_small/weights/yolov4_cspdarknet_tiny_epoch_002.tlt
INFO: Training loop in progress
Epoch 3/80
348/348 [==============================] - 1057s 3s/step - loss: 1508.3203
INFO: Training loop in progress
Epoch 4/80
348/348 [==============================] - 1089s 3s/step - loss: 929.8874
Producing predictions:  12%|██▏                | 26/224 [00:40<05:11,  1.57s/it]2022-08-08 22:26:13,946 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

After running again from latest checkpoint:

INFO: Starting Training Loop.
Epoch 4/80
  2/348 [..............................] - ETA: 2:22:18 - loss: 432.2330/usr/local/lib/python3.6/dist-packages/keras/callbacks.py:122: UserWarning: Method on_batch_end() is slow compared to the batch update (4.146950). Check your callbacks.
  % delta_t_median)
348/348 [==============================] - 1124s 3s/step - loss: 436.2395
d816db1008cd:55:86 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.10<0>
d816db1008cd:55:86 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
d816db1008cd:55:86 [0] NCCL INFO P2P plugin IBext
d816db1008cd:55:86 [0] NCCL INFO NET/IB : No device found.
d816db1008cd:55:86 [0] NCCL INFO NET/IB : No device found.
d816db1008cd:55:86 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.10<0>
d816db1008cd:55:86 [0] NCCL INFO Using network Socket
NCCL version 2.11.4+cuda11.6
d816db1008cd:55:86 [0] NCCL INFO Channel 00/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 01/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 02/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 03/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 04/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 05/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 06/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 07/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 08/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 09/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 10/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 11/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 12/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 13/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 14/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 15/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 16/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 17/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 18/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 19/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 20/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 21/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 22/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 23/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 24/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 25/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 26/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 27/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 28/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 29/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 30/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Channel 31/32 :    0
d816db1008cd:55:86 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1
d816db1008cd:55:86 [0] NCCL INFO Connected all rings
d816db1008cd:55:86 [0] NCCL INFO Connected all trees
d816db1008cd:55:86 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
d816db1008cd:55:86 [0] NCCL INFO comm 0x7f47b8792ef0 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
Producing predictions: 100%|██████████████████| 224/224 [01:40<00:00,  2.24it/s]
Start to calculate AP for each class
*******************************
car           AP    0.10786
              mAP   0.10786
*******************************
Validation loss: 347.53958647591725

Epoch 00004: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned_v2_small/weights/yolov4_cspdarknet_tiny_epoch_004.tlt
INFO: Training loop in progress
Epoch 5/80
348/348 [==============================] - 1071s 3s/step - loss: 306.5322
Producing predictions: 100%|██████████████████| 224/224 [01:39<00:00,  2.25it/s]
Start to calculate AP for each class
*******************************
car           AP    0.04982
              mAP   0.04982
*******************************
Validation loss: 253.0957134791783

Epoch 00005: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned_v2_small/weights/yolov4_cspdarknet_tiny_epoch_005.tlt
INFO: Training loop in progress
Epoch 6/80
348/348 [==============================] - 1057s 3s/step - loss: 226.9619
Producing predictions:  12%|██▍                | 28/224 [00:44<05:13,  1.60s/it]2022-08-09 16:14:47,423 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please check if there is OOM(out of memory).
You can set a lower batch size and retry.

Where I can check it? I looked into status.json and yolov4_training_log_cspdarknet_tiny.csv and didn’t see any errors or anything indicating something wrong

You can run inside the docker.
$ tao yolo_v4_tiny run /bin/bash
then inside the docker,
# yolo_v4_tiny train xxx

inside docker I get bash: tao: command not found
Do I need to do additional steps after running tao yolo_v4_tiny run /bin/bash ? I’ve never run tao inside docker before I always used notebooks

I tried to reduce batch size for both training and validation in the when running in the notebook but issue still persists

Firstly, open a terminal and run below command to trigger docker.
$ tao yolo_v4_tiny run /bin/bash

Then inside the docker,
# yolo_v4_tiny train xxx

I ran in a terminal $ tao yolo_v4_tiny run /bin/bash and after that docker opened and then wrokspace folder opened and inside it I ran full yolo_v4_tiny train command and got the error bash: tao: command not found

Inside the docker, it is not needed to use “tao”. Just run as below.
# yolo_v4_tiny train xxx

Sorry, didn’t noticed that ‘tao’ part. So I managed to run but nothing new, no error. training stopped after this. As I can see only the end of the log here it is:

7eb78bd14e0e:185:216 [0] NCCL INFO Connected all rings
7eb78bd14e0e:185:216 [0] NCCL INFO Connected all trees
7eb78bd14e0e:185:216 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
7eb78bd14e0e:185:216 [0] NCCL INFO comm 0x7f5c94794630 rank 0 nranks 1 cudaDev 0 busId 40 - Init COMPLETE
Producing predictions: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 885/885 [03:00<00:00,  4.91it/s]
Start to calculate AP for each class
*******************************
car           AP    0.84141
              mAP   0.84141
*******************************
Validation loss: 22.364267892352604

Epoch 00011: saving model to /workspace/tao-experiments/yolo_v4_tiny/experiment_dir_unpruned_v2_medium/weights/yolov4_cspdarknet_tiny_epoch_011.tlt
INFO: Training loop in progress
Epoch 12/50
689/689 [==============================] - 2103s 3s/step - loss: 24.5639
Producing predictions:   3%|█████▊     
#       

Is it 100% reproduced at “Producing predictions: 3%” ?

Please use small part of validation images and retry.

Is it 100% reproduced at “Producing predictions: 3%” ?

Sometimes it crashes around 2-3%, sometimes at 12%, sometimes at 36%

Training didn’t stop on validation step when I ran validation on dataset of 12 images. But why is it like that? Do you have any ideas how I could use bigger validation set during training?

(post deleted by author)

So, please use bisection method to check which images are the culprit.

how there could be any bad / corrupt images as first validation step pass all the time?

Assume you have 1000 images, you can run as below.

  1. Select 500 images to check if the issue can be reproduced.
  2. If not, use other 500 images to check if the issue can be reperoduced.

You can use "tao yolo_v4_tiny evaluate xxx " to test directly.

If I ran evaluation on all of images in the dataset or in parts it doesn’t matter because the output is always the same. It always correct and never crashes.
In evaluation I can’t reproduce the issue

Thanks for the info.
May I know the nvidia-driver info? Please share $nvidia-smi result with me.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    52W / 300W |   7414MiB / 16384MiB |     23%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4264      C   /usr/bin/python3.6                303MiB |
|    0   N/A  N/A      4300      C   python3.6                        7109MiB |

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Could you set below to False and retry?
force_on_cpu: False