Please provide the following information when requesting support.
• Hardware: GCP GPU-based VM
• Network Type Detectnet_v2
• TLT Version: 3.0
• Training spec file: Updated to resnet 34-detectnet_v2_train_resnet18_kitti.txt
I am training detectnet using the pretrained peoplenet model. I updated the above mentioned spec file to meet the requirements of peoplenet and following the instructions in this blog: https://developer.nvidia.com/blog/training-custom-pretrained-models-using-tlt/
In the spec file I still have two sections that need adapting to the 3 classes of peoplenet(from the original REsnet18 classes). These sections are: bbox_rasterizer_config and cost_function_config
Below is the current spec file-(not functional yet)-
random_seed: 42
dataset_config {
data_sources: {
tfrecords_path: ‘/workspace/tao-experiments/data/tfrecords/kitti_trainval/*’
image_directory_path: ‘/workspace/tao-experiments/data/training’
}
image_extension: ‘jpg’
target_class_mapping {
key: ‘person’
value: ‘person’
}
target_class_mapping {
key: ‘face’
value: ‘face’
}
target_class_mapping {
key: ‘bag’
value: ‘bag’
}
validation_fold: 0
For evaluation on test set
validation_data_source: {
tfrecords_path: “/path/to/test_tfrecords/*”
image_directory_path: “/path/to/test_root”
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config{
target_class_config{
key: ‘person’
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.265
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
target_class_config{
key: ‘bag’
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
target_class_config{
key: ‘face’
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 2
}
}
}
}
model_config {
num_layers: 34
pretrained_model_file: ‘/workspace/tao-experiments/detectnet_v2/pretrained_resnet34-peoplenet/resnet34_peoplenet.tlt’
freeze_blocks: 0
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: ‘resnet’
all_projections: true
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 120
minimum_detection_ground_truth_overlap {
key: ‘bag’
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: ‘face’
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: ‘person’
value: 0.5
}
evaluation_box_config {
key: ‘bag’
value {
minimum_height: 40
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
evaluation_box_config {
key: ‘face’
value {
minimum_height: 2
maximum_height: 9999
minimum_width: 2
maximum_width: 9999
}
}
evaluation_box_config {
key: ‘person’
value {
minimum_height: 40
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}
##to be updated to 3 classes
cost_function_config {
target_classes {
name: ‘car’
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: ‘cov’
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: ‘bbox’
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: ‘cyclist’
class_weight: 8.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: ‘cov’
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: ‘bbox’
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: ‘pedestrian’
class_weight: 4.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: ‘cov’
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: ‘bbox’
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 24
num_epochs: 120
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 0.0005
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-09
}
optimizer {
adam {
epsilon: 9.9e-09
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
##to be updated to 3 classes
bbox_rasterizer_config {
target_class_config {
key: ‘car’
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
target_class_config {
key: “cyclist”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
target_class_config {
key: ‘pedestrian’
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 1.0
cov_radius_y: 1.0
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}
When I run this cell:
!tao detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet34_kitti-Copy18.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-n resnet34_detector
–gpus $NUM_GPUS
I get this message:
2021-09-27 12:36:31,531 [INFO] root: Registry: [‘nvcr.io’]
2021-09-27 12:36:31,611 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/janet/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:43: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:68: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:68: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
2021-09-27 12:36:38,896 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2021-09-27 12:36:38,896 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/init.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2021-09-27 12:36:39,553 [INFO] iva.common.logging.logging: Log file already exists at /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/status.json
2021-09-27 12:36:39,554 [INFO] main: Loading experiment spec at /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet34_kitti-Copy18.txt.
2021-09-27 12:36:39,556 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet34_kitti-Copy18.txt
2021-09-27 12:36:40,159 [INFO] main: Cannot iterate over exactly 6434 samples with a batch size of 24; each epoch will therefore take one extra step.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:107: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
2021-09-27 12:36:40,162 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:107: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:110: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
2021-09-27 12:36:40,162 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:110: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:113: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
2021-09-27 12:36:40,165 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:113: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
2021-09-27 12:36:40,250 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
2021-09-27 12:36:40,252 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
2021-09-27 12:36:40,276 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.
2021-09-27 12:36:42,850 [INFO] main: Training was interrupted.
Time taken to run main:main: 0:00:03.962915.
2021-09-27 12:36:43,989 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
Can you help me figure out why the training was stopped and how to update the sections in the spec file?
Thanks