Retinanet resume_model_path in TAO 3.21.11 causes errors

megan.e.morrison · January 31, 2022, 7:08pm

When trying to resume training a retinanet model, I get the following error:

File “/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/retinanet/utils/model_io.py”, line 172, in load_model_as_pretrain
File “/usr/local/lib/python3.6/dist-packages/keras/engine/input_layer.py”, line 167, in Input
assert shape is not None, (‘Please provide to Input either a `shape`’
AssertionError: Please provide to Input either a `shape` or a `batch_shape` argument. Note that `shape` does not include the batch dimension.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[53235,1],0]
Exit code: 1

I am using resume_model_path in the spec file. I’ve filled out all other information below.

Please advise as to why this is happening. We do need to start/stop our trainings periodically so we need this functionality to work.

Please provide the following information when requesting support.

• Hardware: V100
• Network Type: Retinanet
• TLT Version: tao 3.21.11, docker tag: v3.21.11-tf1.15.5-py3, v3.21.11-py3
• Training spec file

random_seed: 42
retinanet_config {
aspect_ratios_global: “[1.0, 2.0, 0.5]”
scales: “[0.045, 0.09, 0.2, 0.4, 0.55, 0.7]”
two_boxes_for_ar1: false
clip_boxes: false
loss_loc_weight: 0.8
focal_loss_alpha: 0.25
focal_loss_gamma: 2.0
variances: “[0.1, 0.1, 0.2, 0.2]”
arch: “resnet”
nlayers: 18
n_kernels: 1
n_anchor_levels: 1
feature_size: 256
freeze_bn: false
freeze_blocks: 0
}
training_config {
enable_qat: True
batch_size_per_gpu: 4
num_epochs: 200
#pretrain_model_path: “/workspace/taov3/tao_pretrained_models/resnet_18.hdf5”
resume_model_path: “/workspace/taov3/model_1/output/weights/retinanet_resnet18_epoch_010.tlt”
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-3
soft_start: 0.15
annealing: 0.5
}
}
regularizer {
type: L1
weight: 2e-5
}
}
eval_config {
validation_period_during_training: 10
average_precision_mode: SAMPLE
batch_size: 4
matching_iou_threshold: 0.5
}
nms_config {
confidence_threshold: 0.01
clustering_iou_threshold: 0.6
top_k: 200
}
augmentation_config {
output_width: 1280
output_height: 736
output_channel: 3
zoom_out_min_scale: 1.0
zoom_out_max_scale: 1.5
contrast: 0.1
saturation: 0.2
hue: 25
random_flip: 0.25

}
dataset_config {
data_sources: {
label_directory_path: “/workspace/data/trainingDataSets/model_1/train/labels”
image_directory_path: “/workspace/data/trainingDataSets/model_1/train/images”
}

target_class_mapping {
key: “person”
value: “person”
}
target_class_mapping {
key: “building”
value: “building”
}
target_class_mapping {
key: “car”
value: “vehicle”
}
target_class_mapping {
key: “truck”
value: “vehicle”
}

validation_data_sources: {
label_directory_path: “/workspace/data/trainingDataSets/model_1/validation/labels”
image_directory_path: “/workspace/data/trainingDataSets/model_1/validation/images”
}
}

• How to reproduce the issue ? Run retinanet training, like so:
!tao retinanet train -e $SPECS_DIR/retinanet_train_resnet18_kitti_seq.txt
-r $USER_EXPERIMENT_DIR/output
-k $KEY
–gpus 2
–gpu_index 2 3

Morganh · February 2, 2022, 3:58am

May I know if it is always reproduced? Could you resume training with another tlt model? Thanks.

megan.e.morrison · February 2, 2022, 1:34pm

It is always reproduced. I have tried to resume training with several .tlt models and I always get that error.

Morganh · February 3, 2022, 11:57am

Thanks for the info, I will check if I can reproduce.

megan.e.morrison · February 3, 2022, 2:34pm

Thank you for looking into this.

Morganh · February 4, 2022, 4:30pm

I can reproduce your result. Will check further internally.
For workaround, could you set below to continue to train?

pretrain_model_path: "/workspace/taov3/model_1/output/weights/retinanet_resnet18_epoch_010.tlt"

megan.e.morrison · February 7, 2022, 1:04pm

Thank you for looking into this further. I will use pretrain_model_path as a workaround.

system · February 21, 2022, 1:05pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Retinanet resume_model_path in TAO 3.21.11 causes errors

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.