DataLossError: corrupted record at 0 when using TFRecords with DetectNet

joshH · January 21, 2022, 11:09am

I converted my dataset using tao detectnet_v2 dataset_convert and tried to train a ResNet18 model, but got the following error:

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

According to google this can be caused when compressing the dataset using GZIP, but not respecting the compression while loading.
My dataset does contain “background” images, where no objects to be detected are shown.
The corresponding label files are thus empty.

tao info --verbose:

Configuration of the TAO Toolkit Instance

dockers: 
        nvidia/tao/tao-toolkit-tf: 
                v3.21.11-tf1.15.5-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. emotionnet
                                6. efficientdet
                                7. fpenet
                                8. gazenet
                                9. gesturenet
                                10. heartratenet
                                11. lprnet
                                12. mask_rcnn
                                13. multitask_classification
                                14. retinanet
                                15. ssd
                                16. unet
                                17. yolo_v3
                                18. yolo_v4
                                19. yolo_v4_tiny
                                20. converter
                v3.21.11-tf1.15.4-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. detectnet_v2
                                2. faster_rcnn
        nvidia/tao/tao-toolkit-pyt: 
                v3.21.11-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. text_classification
                                4. question_answering
                                5. token_classification
                                6. intent_slot_classification
                                7. punctuation_and_capitalization
                                8. spectro_gen
                                9. vocoder
                                10. action_recognition
        nvidia/tao/tao-toolkit-lm: 
                v3.21.08-py3: 
                        docker_registry: nvcr.io
                        tasks: 
                                1. n_gram
format_version: 2.0
toolkit_version: 3.21.11
published_date: 11/08/2021

Morganh · January 21, 2022, 1:25pm

joshH:

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

Can you share the full command and full log when run into above error?

joshH · January 21, 2022, 1:47pm

Here you go.

Command:

tao detectnet_v2 train
  --gpus 1
  --gpu_index 0
  -e /workspace/projects/lego/models/detectnetv2_resnet18_01/spec.cfg
  -r /workspace/projects/lego/models/detectnetv2_resnet18_01
  -k secret_key
  --log_file /workspace/projects/lego/models/detectnetv2_resnet18_01/train.log

Log:

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:43: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

2022-01-21 13:43:17,588 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py:43: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

2022-01-21 13:43:17,685 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/tfhooks/checkpoint_saver_hook.py:25: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:69: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

2022-01-21 13:43:17,686 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:69: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:69: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

2022-01-21 13:43:17,686 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py:69: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2022-01-21 13:43:17,693 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2022-01-21 13:43:17,694 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2022-01-21 13:43:18,228 [INFO] __main__: Loading experiment spec at /workspace/projects/lego/models/detectnetv2_resnet18_01/spec.cfg.
2022-01-21 13:43:18,230 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/projects/lego/models/detectnetv2_resnet18_01/spec.cfg
Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 849, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
  File "<decorator-gen-2>", line 2, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 577, in train_gridbox
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/build_dataloader.py", line 265, in build_dataloader
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 488, in __init__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 535, in _construct_data_sources
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 392, in __init__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 392, in <listcomp>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 391, in <genexpr>
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0

Morganh · January 21, 2022, 1:49pm

Please share training spec file as well.

joshH · January 21, 2022, 1:58pm

Of cause ;)
I replace the keywords starting with a $ sign in python.

random_seed: 42

dataset_config {
  data_sources {
    tfrecords_path: "$dataset/tfrecords/*"
    image_directory_path: "$dataset/kitti_detection/train"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "gruenerStein"
      value: "gruenerStein"
  }
  target_class_mapping {
      key: "gelbeRutsche"
      value: "gelbeRutsche"
  }
  target_class_mapping {
      key: "orangePlatte"
      value: "orangePlatte"
  }
  target_class_mapping {
      key: "schwarzeStange"
      value: "schwarzeStange"
  }
  target_class_mapping {
      key: "gelbesAuge"
      value: "gelbesAuge"
  }
  target_class_mapping {
      key: "blauerBalken"
      value: "blauerBalken"
  }
  validation_fold: 0
}

augmentation_config {
  preprocessing {
    output_image_width: 640
    output_image_height: 480
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8
    translate_max_y: 8
  }
  color_augmentation {
    hue_rotation_max: 5.0
    saturation_shift_max: 0.9
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}

model_config {
  pretrained_model_file: "$pretrained_model"
  arch: "resnet"
  num_layers: 18
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
}

bbox_rasterizer_config {
  target_class_config {
    key: "gruenerStein"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "gelbeRutsche"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "orangePlatte"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "schwarzeStange"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "gelbesAuge"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "blauerBalken"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

cost_function_config {
  target_classes {
    name: "gruenerStein"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "gelbeRutsche"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "orangePlatte"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "schwarzeStange"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "gelbesAuge"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "blauerBalken"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}

training_config {
  batch_size_per_gpu: 2
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-8
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 1
}

postprocessing_config {
  target_class_config {
    key: "gruenerStein"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "gelbeRutsche"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "orangePlatte"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "schwarzeStange"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "gelbesAuge"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "blauerBalken"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
}

evaluation_config {
  average_precision_mode: INTEGRATE
  validation_period_during_training: 2
  first_validation_epoch: 2
  minimum_detection_ground_truth_overlap {
    key: "gruenerStein"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "gelbeRutsche"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "orangePlatte"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "schwarzeStange"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "gelbesAuge"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "blauerBalken"
    value: 0.6
  }
  evaluation_box_config {
    key: "gruenerStein"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "gelbeRutsche"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "orangePlatte"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "schwarzeStange"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "gelbesAuge"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "blauerBalken"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
}

Morganh · January 21, 2022, 2:29pm

Please set explicit path for $dataset and retry.

data_sources {
tfrecords_path: “$dataset/tfrecords/*”
image_directory_path: “$dataset/kitti_detection/train”
}

joshH · January 21, 2022, 3:35pm

Sorry, I should have replaced the variables in the first place.
Here is the final spec:

random_seed: 42

dataset_config {
  data_sources {
    tfrecords_path: "/workspace/projects/lego/data/tfrecords/*"
    image_directory_path: "/workspace/projects/lego/data/kitti_detection/train"
  }
  image_extension: "jpg"
  target_class_mapping {
      key: "gruenerStein"
      value: "gruenerStein"
  }
  target_class_mapping {
      key: "gelbeRutsche"
      value: "gelbeRutsche"
  }
  target_class_mapping {
      key: "orangePlatte"
      value: "orangePlatte"
  }
  target_class_mapping {
      key: "schwarzeStange"
      value: "schwarzeStange"
  }
  target_class_mapping {
      key: "gelbesAuge"
      value: "gelbesAuge"
  }
  target_class_mapping {
      key: "blauerBalken"
      value: "blauerBalken"
  }
  validation_fold: 0
}

augmentation_config {
  preprocessing {
    output_image_width: 640
    output_image_height: 480
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8
    translate_max_y: 8
  }
  color_augmentation {
    hue_rotation_max: 5.0
    saturation_shift_max: 0.9
    contrast_scale_max: 0.1
    contrast_center: 0.5
  }
}

model_config {
  pretrained_model_file: "/workspace/repositories/pretrained_detectnet_v2/pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
  arch: "resnet"
  num_layers: 18
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
}

bbox_rasterizer_config {
  target_class_config {
    key: "gruenerStein"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "gelbeRutsche"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "orangePlatte"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "schwarzeStange"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "gelbesAuge"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  target_class_config {
    key: "blauerBalken"
    value: {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.4
      cov_radius_y: 0.4
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.67
}

cost_function_config {
  target_classes {
    name: "gruenerStein"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "gelbeRutsche"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "orangePlatte"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "schwarzeStange"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "gelbesAuge"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  target_classes {
    name: "blauerBalken"
    class_weight: 1.0
    coverage_foreground_weight: 0.05
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.9999
  min_objective_weight: 0.0001
}

training_config {
  batch_size_per_gpu: 2
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.1
      annealing: 0.7
    }
  }
  regularizer {
    type: L1
    weight: 3e-9
  }
  optimizer {
    adam {
      epsilon: 1e-8
      beta1: 0.9
      beta2: 0.999
    }
  }
  cost_scaling {
    enabled: False
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 1
}

postprocessing_config {
  target_class_config {
    key: "gruenerStein"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "gelbeRutsche"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "orangePlatte"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "schwarzeStange"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "gelbesAuge"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
  target_class_config {
    key: "blauerBalken"
    value {
      clustering_config {
        coverage_threshold: 0.005
        dbscan_eps: 0.15
        dbscan_min_samples: 0.05
        minimum_bounding_box_height: 20
      }
    }
  }
}

evaluation_config {
  average_precision_mode: INTEGRATE
  validation_period_during_training: 2
  first_validation_epoch: 2
  minimum_detection_ground_truth_overlap {
    key: "gruenerStein"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "gelbeRutsche"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "orangePlatte"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "schwarzeStange"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "gelbesAuge"
    value: 0.6
  }
  minimum_detection_ground_truth_overlap {
    key: "blauerBalken"
    value: 0.6
  }
  evaluation_box_config {
    key: "gruenerStein"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "gelbeRutsche"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "orangePlatte"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "schwarzeStange"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "gelbesAuge"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
  evaluation_box_config {
    key: "blauerBalken"
    value {
      minimum_height: 20
      maximum_height: 200
      minimum_width: 10
      maximum_width: 200
    }
  }
}

Morganh · January 21, 2022, 3:52pm

Can you also share the log when you run tao detectnet_v2 dataset_convert ?
Please also run "tao detectnet_v2 run ll -rltsh /workspace/projects/lego/data/tfrecords/* " as well.

joshH · January 21, 2022, 4:10pm

Does TAO not support “Background” images with empty label files?
The log:

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-9mvsh40l because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2022-01-21 15:57:01,007 [INFO] iva.detectnet_v2.dataio.build_converter: Instantiating a kitti converter
2022-01-21 15:57:01,007 [INFO] root: Instantiating a kitti converter
2022-01-21 15:57:01,007 [INFO] root: Generating partitions
2022-01-21 15:57:01,008 [INFO] iva.detectnet_v2.dataio.kitti_converter_lib: Num images in
Train: 317      Val: 51
2022-01-21 15:57:01,008 [INFO] root: Num images in
Train: 317      Val: 51
2022-01-21 15:57:01,008 [INFO] iva.detectnet_v2.dataio.kitti_converter_lib: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2022-01-21 15:57:01,008 [INFO] root: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2022-01-21 15:57:01,008 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 0
2022-01-21 15:57:01,008 [INFO] root: Writing partition 0, shard 0
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:161: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2022-01-21 15:57:01,009 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:161: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170008.txt"
2022-01-21 15:57:01,019 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 1
2022-01-21 15:57:01,019 [INFO] root: Writing partition 0, shard 1
2022-01-21 15:57:01,024 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 2
2022-01-21 15:57:01,024 [INFO] root: Writing partition 0, shard 2
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165811.txt"
2022-01-21 15:57:01,029 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 3
2022-01-21 15:57:01,029 [INFO] root: Writing partition 0, shard 3
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170038.txt"
2022-01-21 15:57:01,034 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 4
2022-01-21 15:57:01,034 [INFO] root: Writing partition 0, shard 4
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165712.txt"
2022-01-21 15:57:01,038 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 5
2022-01-21 15:57:01,039 [INFO] root: Writing partition 0, shard 5
2022-01-21 15:57:01,043 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 6
2022-01-21 15:57:01,043 [INFO] root: Writing partition 0, shard 6
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165630.txt"
2022-01-21 15:57:01,048 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 7
2022-01-21 15:57:01,048 [INFO] root: Writing partition 0, shard 7
2022-01-21 15:57:01,053 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 8
2022-01-21 15:57:01,053 [INFO] root: Writing partition 0, shard 8
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165827.txt"
2022-01-21 15:57:01,058 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 9
2022-01-21 15:57:01,058 [INFO] root: Writing partition 0, shard 9
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165707.txt"
2022-01-21 15:57:01,064 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: 
Wrote the following numbers of objects:
b'blauerbalken': 8
b'gelberutsche': 16
b'orangeplatte': 7
b'gruenerstein': 10
b'gelbesauge': 3

2022-01-21 15:57:01,065 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 0
2022-01-21 15:57:01,065 [INFO] root: Writing partition 1, shard 0
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165923.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165805.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165926.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165937.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165859.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165929.txt"
2022-01-21 15:57:01,092 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 1
2022-01-21 15:57:01,093 [INFO] root: Writing partition 1, shard 1
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165833.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165845.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165855.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170256.txt"
2022-01-21 15:57:01,121 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 2
2022-01-21 15:57:01,121 [INFO] root: Writing partition 1, shard 2
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165726.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170000.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165748.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170018.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165734.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165817.txt"
2022-01-21 15:57:01,149 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 3
2022-01-21 15:57:01,149 [INFO] root: Writing partition 1, shard 3
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165759.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165848.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165700.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165745.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165548.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165943.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165907.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170027.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165652.txt"
2022-01-21 15:57:01,177 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 4
2022-01-21 15:57:01,177 [INFO] root: Writing partition 1, shard 4
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165852.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165647.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165912.txt"
2022-01-21 15:57:01,205 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 5
2022-01-21 15:57:01,205 [INFO] root: Writing partition 1, shard 5
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170022.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165604.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165808.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165625.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165737.txt"
2022-01-21 15:57:01,234 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 6
2022-01-21 15:57:01,234 [INFO] root: Writing partition 1, shard 6
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165541.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165918.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165957.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165636.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165947.txt"
2022-01-21 15:57:01,263 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 7
2022-01-21 15:57:01,263 [INFO] root: Writing partition 1, shard 7
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165741.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165730.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165617.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165950.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165716.txt"
2022-01-21 15:57:01,291 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 8
2022-01-21 15:57:01,292 [INFO] root: Writing partition 1, shard 8
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165802.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165920.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165756.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165552.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165829.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170010.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170005.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165836.txt"
2022-01-21 15:57:01,320 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 9
2022-01-21 15:57:01,320 [INFO] root: Writing partition 1, shard 9
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165939.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201008_093429.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_170253.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165641.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165721.txt"
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: "/workspace/projects/lego/data/kitti_detection/train/label_2/IMG_20201018_165612.txt"
2022-01-21 15:57:01,355 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: 
Wrote the following numbers of objects:
b'orangeplatte': 40
b'gelberutsche': 90
b'blauerbalken': 27
b'gruenerstein': 74
b'schwarzestange': 15
b'gelbesauge': 14

2022-01-21 15:57:01,355 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Cumulative object statistics
2022-01-21 15:57:01,355 [INFO] root: Cumulative object statistics
2022-01-21 15:57:01,355 [INFO] root: {
    "blauerbalken": 35,
    "gelberutsche": 106,
    "orangeplatte": 47,
    "gruenerstein": 84,
    "gelbesauge": 17,
    "schwarzestange": 15
}
2022-01-21 15:57:01,355 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: 
Wrote the following numbers of objects:
b'blauerbalken': 35
b'gelberutsche': 106
b'orangeplatte': 47
b'gruenerstein': 84
b'gelbesauge': 17
b'schwarzestange': 15

2022-01-21 15:57:01,355 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Class map. 
Label in GT: Label in tfrecords file 
b'blauerBalken': b'blauerbalken'
b'gelbeRutsche': b'gelberutsche'
b'orangePlatte': b'orangeplatte'
b'gruenerStein': b'gruenerstein'
b'gelbesAuge': b'gelbesauge'
b'schwarzeStange': b'schwarzestange'
2022-01-21 15:57:01,355 [INFO] root: Class map. 
Label in GT: Label in tfrecords file 
b'blauerBalken': b'blauerbalken'
b'gelbeRutsche': b'gelberutsche'
b'orangePlatte': b'orangeplatte'
b'gruenerStein': b'gruenerstein'
b'gelbesAuge': b'gelbesauge'
b'schwarzeStange': b'schwarzestange'
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2022-01-21 15:57:01,356 [INFO] root: For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2022-01-21 15:57:01,356 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Tfrecords generation complete.
2022-01-21 15:57:01,356 [INFO] root: TFRecords generation complete.

STDERR: 2022-01-21 16:56:52,683 [INFO] root: Registry: ['local.pwn']
2022-01-21 16:56:52,764 [INFO] tlt.components.instance_handler.local_instance: Running command in container: local.pwn/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-01-21 16:57:02,228 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The other command sadly just returns
"ll": executable file not found in $PATH: unknown

Morganh · January 21, 2022, 5:21pm

Please change ll to ls.

joshH · January 21, 2022, 5:36pm

Okay, so I ran tao detectnet_v2 run ls -rltsh /workspace/projects/lego/data/tfrecords/
and got

2022-01-21 18:34:24,193 [INFO] root: Registry: ['local.pwn']
2022-01-21 18:34:24,274 [INFO] tlt.components.instance_handler.local_instance: Running command in container: local.pwn/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
total 248K
4.0K -rw-r--r-- 1 1000 1000  332 Jan 21 15:56 converter_spec.cfg
4.0K -rw-r--r-- 1 1000 1000 3.1K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00001-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.0K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00000-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.0K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00003-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.0K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00002-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.1K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00005-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.0K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00004-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.1K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00007-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.0K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00006-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.6K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00009-of-00010
4.0K -rw-r--r-- 1 1000 1000 3.0K Jan 21 15:57 tfrecord-fold-000-of-002-shard-00008-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00000-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00001-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00002-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00003-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00004-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00005-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00006-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00007-of-00010
 20K -rw-r--r-- 1 1000 1000  19K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00008-of-00010
 24K -rw-r--r-- 1 1000 1000  23K Jan 21 15:57 tfrecord-fold-001-of-002-shard-00009-of-00010
2022-01-21 18:34:27,893 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · January 22, 2022, 3:14am

I am afraid I was missing above comment. The non-empty label files are expected.
Could you try to run without background images?

joshH · January 26, 2022, 2:06pm

Ok, so I think I found the issue. I had the converter spec file inside the tfrecords dir, and the train spec did not exclude the file.
For anyone with the same problem: Make sure your glob statement only includes the tfrecord files!

Sadly, when starting the training now, a oom error is reported. I already use a batch size of 1, so I am not really sure why this happens. I also tracked the gpu memory, but that does seem to cause the issue.
Error log excerpt:

INFO:tensorflow:Graph was finalized.
2022-01-26 13:02:18,254 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-01-26 13:02:20,531 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-01-26 13:02:21,143 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-01-26 13:02:27,901 [INFO] tensorflow: Saving checkpoints for step-0.
INFO:tensorflow:epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.10506207, step = 0
2022-01-26 13:02:56,559 [INFO] tensorflow: epoch = 0.0, learning_rate = 4.9999994e-06, loss = 0.10506207, step = 0
2022-01-26 13:02:56,665 [INFO] iva.detectnet_v2.tfhooks.task_progress_monitor_hook: Epoch 0/120: loss: 0.10506 learning rate: 0.00000 Time taken: 0:00:00 ETA: 0:00:00
2022-01-26 13:02:56,665 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 0.083

block_3a_relu_1 (Activation)    (None, 256, 30, 40)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 30, 40)  590080      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 30, 40)  33024       block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 30, 40)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 30, 40)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 30, 40)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (None, 256, 30, 40)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 30, 40)  590080      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 30, 40)  1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (None, 256, 30, 40)  0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 30, 40)  590080      block_3b_relu_1[0][0]            
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 30, 40)  1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 30, 40)  0           block_3b_bn_2[0][0]              
                                                                 block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_relu (Activation)      (None, 256, 30, 40)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 30, 40)  1180160     block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 30, 40)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (None, 512, 30, 40)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 30, 40)  2359808     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 30, 40)  131584      block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 30, 40)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 30, 40)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 30, 40)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (None, 512, 30, 40)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 30, 40)  2359808     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 30, 40)  2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (None, 512, 30, 40)  0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 30, 40)  2359808     block_4b_relu_1[0][0]            
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 30, 40)  2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 30, 40)  0           block_4b_bn_2[0][0]              
                                                                 block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_relu (Activation)      (None, 512, 30, 40)  0           add_8[0][0]                      
__________________________________________________________________________________________________
output_bbox (Conv2D)            (None, 24, 30, 40)   12312       block_4b_relu[0][0]              
__________________________________________________________________________________________________
output_cov (Conv2D)             (None, 6, 30, 40)    3078        block_4b_relu[0][0]              
==================================================================================================
Total params: 11,210,718
Trainable params: 11,200,990
Non-trainable params: 9,728
__________________________________________________________________________________________________
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: Failed to allocate memory for the batch of component 0
	 [[{{node data_loader_out}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: Failed to allocate memory for the batch of component 0
	 [[{{node data_loader_out}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[LookupTable_4/hash_table_Lookup/LookupTableFindV2/_3985]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
  File "<decorator-gen-2>", line 2, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 644, in train_gridbox
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 155, in run_training_loop
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: Failed to allocate memory for the batch of component 0
	 [[node data_loader_out (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: Failed to allocate memory for the batch of component 0
	 [[node data_loader_out (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[LookupTable_4/hash_table_Lookup/LookupTableFindV2/_3985]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'data_loader_out':
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 838, in <module>
  File "<decorator-gen-2>", line 2, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 827, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 708, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 619, in train_gridbox
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 447, in build_training_graph
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py", line 694, in get_dataset_tensors
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/trainers/multi_task_trainer/data_loader_interface.py", line 77, in __call__
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/data_loader.py", line 512, in call
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/iterator_ops.py", line 429, in get_next
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
    output_shapes=output_shapes, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py", line 841, in <module>
AttributeError: module 'logging' has no attribute 'getLoggger'

Morganh · January 26, 2022, 2:39pm

Can you check nvidia-smi during training?

joshH · January 26, 2022, 2:48pm

Edit:
I am running tao from inside a WSL 2 instance on a Windows 10 machine.

This is the output of nvidia-smi just before the training stopped:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:09:00.0 Off |                  N/A |
|  0%   43C    P2   176W / 350W |   7225MiB / 24576MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Morganh · January 26, 2022, 2:58pm

Please try with 320x240

joshH · January 26, 2022, 3:32pm

Tried 320x240, but the error persists. There seems to be a typo AttributeError: module 'logging' has no attribute 'getLoggger'

Morganh · January 27, 2022, 2:11am

No, it is not the root cause.
The failure still results from OOM.

Morganh · January 27, 2022, 2:14am

Original issue is fixed by ourselves.

For the OOM running inside a WSL 2 instance on a Windows 10 machine, please double check. Or run with another network, for example, lprnet.

joshH · February 2, 2022, 12:56am

Looking forward to the next updates of tao!

To fix the OOM issue I had to reduce the image size from 3000x3000 to 1000x1000.
I admit, the resolution is excessively large, but that was out of my control and resizing the whole dataset is a time consuming tasks…
Moreover I thought that the preprocessor would resize the images to 300x300, so no memory issues, but that seems not to be true.

On a sidenode:
As tao can be quite tedious to work with, because of all the different paths to handle (say local vs. docker paths),
I came up with a little project to structure and execute tao projects automatically. You can find it on my github right here.
It is a opinionated way to organize and train machine learning models with NVIDIA TAO. It does all the housekeeping around directory structures and keeping your experiments clean. It is still work in progress, but I successfully converted, trained and exported a dssd and a detectnet detector with it. I thought some of you might be interested in something like this :)

I have another question though: What should I tune in the detectnetv2 config, if every average precision is 0.0 (and thus the mean average precision) throughout all epochs?

Thanks for your support!

Topic		Replies	Views
Excute tao model detectnet_v2 train but Failed TAO Toolkit tao	5	306	June 4, 2024
Tao detectnet_v2 train failed with g_error_metadata.to_exception in autograph module TAO Toolkit tao	12	1507	January 10, 2022
Original error: could not get source code TAO Toolkit	8	765	July 6, 2022
TAO 5.0 failed to train TAO Toolkit	8	646	August 1, 2023
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	242	October 30, 2024
Detectnet_v2.ipynb issue with custom data TAO Toolkit tao	3	349	May 17, 2024
DataLossError: corrupted record at 0 when using resnet18 TAO Toolkit	14	1046	July 11, 2022
Error while training using the detectnet_v2 notebook provided in the TAO toolkit with using the custom dataset TAO Toolkit computer-vision-cv , tao	16	1571	January 13, 2023
Run detectnet_v2.ipynb error with my own data TAO Toolkit tao	23	1624	March 4, 2022
TAO Toolkit Training Error TAO Toolkit	2	779	August 2, 2022

DataLossError: corrupted record at 0 when using TFRecords with DetectNet

Related topics