Hi,
I successfully trained the Yolo_v4 model with Resnet 50 backbone. I trained it 80 epochs. I want to resume the training for another 40 epochs.
Here is my setup:
• Hardware: 2 x RTX 2080 Ti
• Network Type: Yolo_v4 + Resnet50
• TLT Version
Configuration of the TLT Instance
dockers: ['nvcr.io/nvidia/tlt-streamanalytics', 'nvcr.io/nvidia/tlt-pytorch']
format_version: 1.0
tlt_version: 3.0
published_date: 02/02/2021
• Training spec file
random_seed: 42
yolov4_config {
big_anchor_shape: "[(141.60, 117.22), (199.38, 181.77), (358.80, 307.33)]"
mid_anchor_shape: "[(63.88, 72.37), (85.18, 58.48), (96.16, 90.22)]"
small_anchor_shape: "[(18.65, 16.55), (38.33, 34.70), (55.88, 48.94)]"
box_matching_iou: 0.25
arch: "resnet"
nlayers: 50
arch_conv_blocks: 2
loss_loc_weight: 0.8
loss_neg_obj_weights: 100.0
loss_class_weights: 0.5
label_smoothing: 0.0
big_grid_xy_extend: 0.05
mid_grid_xy_extend: 0.1
small_grid_xy_extend: 0.2
freeze_bn: false
#freeze_blocks: 0
force_relu: false
}
training_config {
batch_size_per_gpu: 2
num_epochs: 40
enable_qat: false
checkpoint_interval: 10
learning_rate {
soft_start_cosine_annealing_schedule {
min_learning_rate: 1e-7
max_learning_rate: 1e-4
soft_start: 0.3
}
}
regularizer {
type: L1
weight: 3e-5
}
optimizer {
adam {
epsilon: 1e-7
beta1: 0.9
beta2: 0.999
amsgrad: false
}
}
#pretrain_model_path: "/workspace/tlt-experiments/yolo_v4/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5"
resume_model_path: "/workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet50_epoch_080.tlt"
}
eval_config {
average_precision_mode: SAMPLE
batch_size: 8
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.001
clustering_iou_threshold: 0.5
top_k: 200
}
augmentation_config {
hue: 0.1
saturation: 1.5
exposure:1.5
vertical_flip:0
horizontal_flip: 0.5
jitter: 0.3
output_width: 1152
output_height: 576
randomize_input_shape_period: 0
mosaic_prob: 0.5
mosaic_min_ratio:0.2
}
dataset_config {
data_sources: {
label_directory_path: "/workspace/tlt-experiments/data/train/labels"
image_directory_path: "/workspace/tlt-experiments/data/train/images"
}
include_difficult_in_training: true
target_class_mapping {
key: "ascus"
value: "ascus"
}
target_class_mapping {
key: "asch"
value: "asch"
}
target_class_mapping {
key: "lsil"
value: "lsil"
}
target_class_mapping {
key: "hsil"
value: "hsil"
}
target_class_mapping {
key: "scc"
value: "scc"
}
target_class_mapping {
key: "agc"
value: "agc"
}
target_class_mapping {
key: "trichomonas"
value: "trichomonas"
}
target_class_mapping {
key: "candida"
value: "candida"
}
target_class_mapping {
key: "flora"
value: "flora"
}
target_class_mapping {
key: "herps"
value: "herps"
}
target_class_mapping {
key: "actinomyces"
value: "actinomyces"
}
validation_data_sources: {
label_directory_path: "/workspace/tlt-experiments/data/val/labels"
image_directory_path: "/workspace/tlt-experiments/data/val/images"
}
}
To resume training, I commented out pretrain_model_path
in the spec file and added resume_model_path: "/workspace/tlt-experiments/yolo_v4/experiment_dir_unpruned/weights/yolov4_resnet50_epoch_080.tlt"
setting.
After running:
!tlt yolo_v4 train -e $SPECS_DIR/yolo_v4_train_resnet50_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
--gpus 2
I got the following error:
Traceback (most recent call last):
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 209, in <module>
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 205, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/scripts/train.py", line 162, in run_experiment
File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 102, in fit_generator
callbacks.on_train_begin()
File "/usr/local/lib/python3.6/dist-packages/keras/callbacks.py", line 132, in on_train_begin
callback.on_train_begin(logs)
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 748, in on_train_begin
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/utils.py", line 771, in get_learning_rate
ValueError: SoftStartCosineAnnealingScheduler does not support a progress value < 0.0 or > 1.0 received (2.000000)
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[11310,1],1]
Exit code: 1
--------------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/bin/yolo_v4", line 8, in <module>
sys.exit(main())
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/yolo_v4/entrypoint/yolo_v4.py", line 12, in main
File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-07-15 20:13:39,963 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
What can I do wrong?
Cheers
Jarek