Training of Yolov3 model get randomly killed

Hi, I’m experimenting ranom kills of the training process, example:

Epoch 51/80
293/293 [==============================] - 190s 648ms/step - loss: 0.5162
Epoch 52/80
293/293 [==============================] - 188s 640ms/step - loss: 0.4087
Epoch 53/80
293/293 [==============================] - 219s 746ms/step - loss: 0.4388
Epoch 54/80
293/293 [==============================] - 230s 784ms/step - loss: 0.4515
Epoch 55/80
149/293 [==============>...............] - ETA: 1:49 - loss: 0.3965Killed
2022-06-03 14:01:32,967 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

This can happen on early epochs (<10) or in middle epochs like the above example.

What I have tried:

  • Reduce batch size from 8 to 4, 2, and 1. Seems to have no effects because the problem is still there

  • Check processes with nvidia-smi: no other processes are active when the error occurred.

What can be:

  • Images too big? I’m testing on a small dataset of 293 sample images with resolution of 4352x3264

What can be causing this error? Thank you.

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) NVIDIA GeForce RTX 2080 Ti

• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) Yolo_v3

• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)

Configuration of the TAO Toolkit Instance

dockers: 		
	nvidia/tao/tao-toolkit-tf: 			
		v3.21.11-tf1.15.5-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. augment
				2. bpnet
				3. classification
				4. dssd
				5. emotionnet
				6. efficientdet
				7. fpenet
				8. gazenet
				9. gesturenet
				10. heartratenet
				11. lprnet
				12. mask_rcnn
				13. multitask_classification
				14. retinanet
				15. ssd
				16. unet
				17. yolo_v3
				18. yolo_v4
				19. yolo_v4_tiny
				20. converter
		v3.21.11-tf1.15.4-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. detectnet_v2
				2. faster_rcnn
	nvidia/tao/tao-toolkit-pyt: 			
		v3.21.11-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. speech_to_text
				2. speech_to_text_citrinet
				3. text_classification
				4. question_answering
				5. token_classification
				6. intent_slot_classification
				7. punctuation_and_capitalization
				8. action_recognition
		v3.22.02-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. spectro_gen
				2. vocoder
	nvidia/tao/tao-toolkit-lm: 			
		v3.21.08-py3: 				
			docker_registry: nvcr.io
			tasks: 
				1. n_gram
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

• Training spec file(If have, please share here)

random_seed: 42
yolov3_config {
  big_anchor_shape: "[(514.17, 78.47), (220.24, 217.41), (642.07, 120.47)]"
  mid_anchor_shape: "[(294.79, 47.24), (169.48, 112.35), (156.00, 184.47)]"
  small_anchor_shape: "[(30.97, 30.12), (112.12, 67.47), (108.68, 127.88)]"
  matching_neutral_box_iou: 0.7
  arch: "resnet"
  nlayers: 18
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 1.0
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  use_multiprocessing: true
  batch_size_per_gpu: 1
  num_epochs: 80
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-4
    soft_start: 0.1
    annealing: 0.5
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  resume_model_path: "/workspace/tao-experiments/tao_yolo_v3_01/experiment_dir_unpruned/yolov3_resnet18_epoch_040.tlt"
  # pretrain_model_path: "/workspace/tao-experiments/tao_yolo_v3_01/pretrained_resnet18/pretrained_object_detection_vresnet18/resnet_18.hdf5"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  top_k: 200
  force_on_cpu: True
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 608
  output_height: 608
  output_channel: 3
  randomize_input_shape_period: 0
}
dataset_config {
  data_sources: {
      tfrecords_path: "/workspace/tao-experiments/data/tfrecords/insulators_test_dataset/train/train*"
      image_directory_path: "/workspace/tao-experiments/data/insulators_test_dataset/train/"
  }
  include_difficult_in_training: true
  image_extension: "jpeg"
  target_class_mapping {
      key: "insulator"
      value: "insulator"
  }
  validation_data_sources: {
    tfrecords_path: "/workspace/tao-experiments/data/tfrecords/insulators_test_dataset/test/test*"
    image_directory_path: "/workspace/tao-experiments/data/insulators_test_dataset/test/"
  }
}

Did you ever run default jupyter notebook successfully? It will train on public KITTI dataset.

Yes I ran the entire notebook with the public KITTI dataset successfully.

Please set smaller size and retry. The “killed” is usually due to OOM(out of memory).

More, please try to use sequence data format as well.

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.