Error when training detectnet_v2 resnet34 on tfrecord file

Hi, when running the following command to train:
tao detectnet_v2 train -k 06052019 -r resnet34_people_net/output_peoplenet -e resnet34_people_net/new_peoplenet.cfg

I get the error listed below
My tfrecord file only includes one label (“Person”), it was NOT produced using detectnet’s dataset_convert method.
I understand this may be caused by the tfrecord file itself, but I would like to know if this is a common or known issue.

Thank you!

Traceback (most recent call last):
File"/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 917, in
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 906, in
File “”, line 2, in main
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 893, in main
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 757, in run_experiment
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 626, in train_gridbox
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/build_dataloader.py”, line 273, in build_dataloader
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 491, in init
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 548, in _construct_data_sources
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 398, in init
File “/root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 439, in _get_max_image_size
IndexError: list index (0) out of range

Specfile:
model_config {
arch: “resnet”
pretrained_model_file: “/workspace/resnet34_people_net/resnet34_peoplenet.tlt”
freeze_blocks: 1
all_projections: True
num_layers: 34
use_pooling: False
use_batch_norm: True
objective_set: {
cov {}
bbox {
scale: 35.0
offset: 0.5
}
}
}

bbox_rasterizer_config {
target_class_config {
key: “Person”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.67
}

postprocessing_config {
target_class_config {
key: “Person”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
}
cost_function_config {
target_classes {
name: “Person”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: True
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}
training_config {
batch_size_per_gpu: 16
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
enabled: False
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
visualizer {
enabled: true
num_images: 3
scalar_logging_frequency: 10
infrequent_logging_frequency: 1
target_class_config {
key: “Person”
value: {
coverage_threshold: 0.005
}
}
}
}
augmentation_config {
preprocessing {
output_image_width: 1920
output_image_height: 1072
output_image_channel: 3
enable_auto_resize: True
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {

hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0

}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
}
}

evaluation_config {
average_precision_mode: INTEGRATE
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “person”
value: 0.5
}
evaluation_box_config {
key: “Person”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}
dataset_config {
data_sources: {
tfrecords_path: “/workspace/resnet34_people_net/default.tfrecord”
image_directory_path: “/workspace/Detections_dataset”
}
image_extension: “jpeg”
target_class_mapping {
key: “Person”
value: “Person”
}
validation_fold: 1
}

How did you generate the tfrecord file? Can you share the command?

The tfrecord file is generated directly from CVAT in using their TFRecord 1.0 file format.

Please generate tfrecords via the dataset_convert tool .
See more details in DetectNet_v2 — TAO Toolkit 3.22.05 documentation

OK, so I have exported the CVAT dataset in KITTI format instead and tried to use dataset_convert.
When doing so, the command clearly runs, but no output file or directory is created with the file.
(The directory KITTI_detections/tfrecords is never created)

Command : tao detectnet_v2 dataset_convert -d resnet34_people_net/dataset_conv.cfg -o /KITTI_detections/tfrecords

Output text:
2022-10-19 09:41:51,932 [INFO] root: Registry: [‘nvcr.io’]
2022-10-19 09:41:52,048 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
2022-10-19 09:41:52,079 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/sysadmin/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Using TensorFlow backend.
2022-10-19 09:41:58,794 [INFO] iva.detectnet_v2.dataio.build_converter: Instantiating a kitti converter
2022-10-19 09:41:58,794 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Creating output directory /KITTI_detections
2022-10-19 09:41:58,799 [INFO] iva.detectnet_v2.dataio.kitti_converter_lib: Num images in
Train: 1699 Val: 424
2022-10-19 09:41:58,799 [INFO] iva.detectnet_v2.dataio.kitti_converter_lib: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2022-10-19 09:41:58,800 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 0
WARNING:tensorflow:From /root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:169: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2022-10-19 09:41:58,800 [WARNING] tensorflow: From /root/.cache/bazel/_bazel_root/b770f990bb7b9e2db5771981fb3a38b4/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:169: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:315: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2022-10-19 09:41:58,837 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 1
2022-10-19 09:41:58,870 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 2
2022-10-19 09:41:58,906 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 3
2022-10-19 09:41:58,939 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 4
2022-10-19 09:41:58,971 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 5
2022-10-19 09:41:59,004 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 6
2022-10-19 09:41:59,036 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 7
2022-10-19 09:41:59,069 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 8
2022-10-19 09:41:59,101 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 9
2022-10-19 09:41:59,138 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib:
Wrote the following numbers of objects:
b’person’: 1597

2022-10-19 09:41:59,138 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 0
2022-10-19 09:41:59,276 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 1
2022-10-19 09:41:59,408 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 2
2022-10-19 09:41:59,545 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 3
2022-10-19 09:41:59,683 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 4
2022-10-19 09:41:59,822 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 5
2022-10-19 09:41:59,956 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 6
2022-10-19 09:42:00,089 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 7
2022-10-19 09:42:00,225 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 8
2022-10-19 09:42:00,362 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 9
2022-10-19 09:42:00,511 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib:
Wrote the following numbers of objects:
b’person’: 6684

2022-10-19 09:42:00,511 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Cumulative object statistics
2022-10-19 09:42:00,511 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib:
Wrote the following numbers of objects:
b’person’: 8281

2022-10-19 09:42:00,511 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Class map.
Label in GT: Label in tfrecords file
b’Person’: b’person’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2022-10-19 09:42:00,511 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Tfrecords generation complete.
2022-10-19 09:42:01,660 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

dataset converter config file:
kitti_config {
root_directory_path: “/workspace/KITTI_detections”
image_dir_name: “Images”
label_dir_name: “Labels”
image_extension: “.jpeg”
partition_mode: “random”
num_partitions: 2
val_split: 20
num_shards: 10
}
image_directory_path: “/workspace/KITTI_detections”
target_class_mapping {
key: “person”
value: “person”
}

Make sure you have set correct ~/.tao_mounts.json
For your case, the output tfrecord file locates at /KITTI_detections/ .
If the tao_mounts.json file does not map your local files to docker inside., you will not find the folder.

For tao_mounts.json file, please see TAO Toolkit Launcher — TAO Toolkit 3.22.05 documentation

A simply way is to set destination to the same as source.

Solved the problem, had to do with the destination directory in the dataset_convert command itself (/workspace/KITTI_detections/tfrecords instead of /KITTI_detections/tfrecords)
Now the training command works
Thank you

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.