Resume yolo_v4 traing - SoftStartCosineAnnealingScheduler does not support a progress value error

Jagin · July 15, 2021, 6:37pm

Hi,

I successfully trained the Yolo_v4 model with Resnet 50 backbone. I trained it 80 epochs. I want to resume the training for another 40 epochs.

Here is my setup:

• Hardware: 2 x RTX 2080 Ti
• Network Type: Yolo_v4 + Resnet50
• TLT Version

Configuration of the TLT Instance
dockers: ['nvcr.io/nvidia/tlt-streamanalytics', 'nvcr.io/nvidia/tlt-pytorch']
format_version: 1.0
tlt_version: 3.0
published_date: 02/02/2021

• Training spec file

random_seed: 42
yolov4_config {
  big_anchor_shape: "[(141.60, 117.22), (199.38, 181.77), (358.80, 307.33)]"
  mid_anchor_shape: "[(63.88, 72.37), (85.18, 58.48), (96.16, 90.22)]"
  small_anchor_shape: "[(18.65, 16.55), (38.33, 34.70), (55.88, 48.94)]"
  box_matching_iou: 0.25
  arch: "resnet"
  nlayers: 50
  arch_conv_blocks: 2
  loss_loc_weight: 0.8
  loss_neg_obj_weights: 100.0
  loss_class_weights: 0.5
  label_smoothing: 0.0
  big_grid_xy_extend: 0.05
  mid_grid_xy_extend: 0.1
  small_grid_xy_extend: 0.2
  freeze_bn: false
  #freeze_blocks: 0
  force_relu: false
}
training_config {
  batch_size_per_gpu: 2
  num_epochs: 40
  enable_qat: false
  checkpoint_interval: 10
  learning_rate {
    soft_start_cosine_annealing_schedule {
      min_learning_rate: 1e-7
      max_learning_rate: 1e-4
      soft_start: 0.3
    }
  }
  regularizer {
    type: L1
    weight: 3e-5
  }
  optimizer {
    adam {
      epsilon: 1e-7
      beta1: 0.9
      beta2: 0.999
      amsgrad: false
    }
  }
  #pretrain_model_path: "/workspace/tlt-experiments/yolo_v4/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5"
  resume_model_path: "/workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet50_epoch_080.tlt"
}
eval_config {
  average_precision_mode: SAMPLE
  batch_size: 8
  matching_iou_threshold: 0.5
}
nms_config {
  confidence_threshold: 0.001
  clustering_iou_threshold: 0.5
  top_k: 200
}
augmentation_config {
  hue: 0.1
  saturation: 1.5
  exposure:1.5
  vertical_flip:0
  horizontal_flip: 0.5
  jitter: 0.3
  output_width: 1152
  output_height: 576
  randomize_input_shape_period: 0
  mosaic_prob: 0.5
  mosaic_min_ratio:0.2
}
dataset_config {
  data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/train/labels"
      image_directory_path: "/workspace/tlt-experiments/data/train/images"
  }
  include_difficult_in_training: true
  target_class_mapping {
      key: "ascus"
      value: "ascus"
  }
  target_class_mapping {
      key: "asch"
      value: "asch"
  }
  target_class_mapping {
      key: "lsil"
      value: "lsil"
  }
  target_class_mapping {
      key: "hsil"
      value: "hsil"
  }  
  target_class_mapping {
      key: "scc"
      value: "scc"
  }
  target_class_mapping {
      key: "agc"
      value: "agc"
  }
  target_class_mapping {
      key: "trichomonas"
      value: "trichomonas"
  }
  target_class_mapping {
      key: "candida"
      value: "candida"
  }
  target_class_mapping {
      key: "flora"
      value: "flora"
  }
  target_class_mapping {
      key: "herps"
      value: "herps"
  }
  target_class_mapping {
      key: "actinomyces"
      value: "actinomyces"
  }
  validation_data_sources: {
      label_directory_path: "/workspace/tlt-experiments/data/val/labels"
      image_directory_path: "/workspace/tlt-experiments/data/val/images"
  }
}

To resume training, I commented out pretrain_model_path in the spec file and added resume_model_path: "/workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet50_epoch_080.tlt" setting.

After running:

!tlt yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet50_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 2

I got the following error:

Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 209, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 205, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 162, in run_experiment
  File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
    initial_epoch=initial_epoch)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 102, in fit_generator
    callbacks.on_train_begin()
  File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 132, in on_train_begin
    callback.on_train_begin(logs)
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 748, in on_train_begin
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 771, in get_learning_rate
ValueError: SoftStartCosineAnnealingScheduler does not support a progress value < 0.0 or > 1.0 received (2.000000)
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[11310,1],1]
  Exit code:    1
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/bin/yolo_v4", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-07-15 20:13:39,963 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What can I do wrong?

Cheers
Jarek

Morganh · July 16, 2021, 3:05am

You already trained 80 epochs successfully and the training is complete, right?
If yes, your new training for 40 epochs is a new training.
Please set pretrain_model_path and comment out resume_model_path.

pretrain_model_path: “/workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet50_epoch_080.tlt”

BTW, please update your tlt version. You are running with TLT 3.0-dp-py3 version. Please update to TLT 3.0-py3 version. See the method in https://docs.nvidia.com/tlt/tlt-user-guide/text/tlt_quick_start_guide.html

pip3 install --upgrade nvidia-tlt

system · September 18, 2021, 3:50am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.