'IndexError: tuple index out of range' when Int8 Optimization for LPD training

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
• Network Type (Detectnet_v2)
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
I meet a error when we train lpd model, the log is:

2022-06-28 17:21:39,956 [INFO] root: Registry: ['nvcr.io']
2022-06-28 17:21:40,005 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-u2128em5 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2022-06-28 09:21:48,966 [INFO] root: Building exporter object.
2022-06-28 09:21:50,458 [INFO] root: Exporting the model.
2022-06-28 09:21:50,458 [INFO] root: Using input nodes: ['input_1']
2022-06-28 09:21:50,459 [INFO] root: Using output nodes: ['output_cov/Sigmoid', 'output_bbox/BiasAdd']
2022-06-28 09:21:50,459 [INFO] iva.common.export.keras_exporter: Using input nodes: ['input_1']
2022-06-28 09:21:50,459 [INFO] iva.common.export.keras_exporter: Using output nodes: ['output_cov/Sigmoid', 'output_bbox/BiasAdd']
/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['output_cov/Sigmoid', 'output_bbox/BiasAdd'] as outputs
2022-06-28 09:21:55,581 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
2022-06-28 09:21:55,582 [INFO] root: Calibration takes time especially if number of batches is large.
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  IndexError: tuple index out of range

At:
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(75): get_data_from_source
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(95): get_batch
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(536): __init__
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(695): __init__
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py(436): export
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(247): run_export
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(265): launch_export
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/export.py(12): <module>

Aborted (core dumped)
2022-06-28 17:21:59,857 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The command is:

!tao detectnet_v2 export \
                  -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                  -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.etlt \
                  -k $KEY  \
                  --cal_data_file $USER_EXPERIMENT_DIR/experiment_dir_final/calibration.tensor \
                  --data_type int8 \
                  --batches 10 \
                  --batch_size 4 \
                  --max_batch_size 4\
                  --engine_file $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.trt.int8 \
                  --cal_cache_file $USER_EXPERIMENT_DIR/experiment_dir_final/calibration.bin \
                  --verbose

Please help me, thank you very much.

Please help me, thank you.

Please set
--cal_image_dir images_to_use_for_calibration

Refer to DetectNet_v2 — TAO Toolkit 3.22.05 documentation

I execute as follow:

!rm -rf $LOCAL_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.etlt
!rm -rf $LOCAL_EXPERIMENT_DIR/experiment_dir_final/calibration.bin
!tao detectnet_v2 export \
                  -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt \
                  -o $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.etlt \
                  -k $KEY  \
                  --cal_data_file $USER_EXPERIMENT_DIR/experiment_dir_final/calibration.tensor \
                  --data_type int8 \
                  --batches 10 \
                  --batch_size 4 \
                  --max_batch_size 4\
                  --cal_image_dir $DATA_DOWNLOAD_DIR/testing/image_2 \
                  --engine_file $USER_EXPERIMENT_DIR/experiment_dir_final/resnet18_detector.trt.int8 \
                  --cal_cache_file $USER_EXPERIMENT_DIR/experiment_dir_final/calibration.bin \
                  --verbose

but the error still exist.

2022-06-28 17:59:04,145 [INFO] root: Registry: ['nvcr.io']
2022-06-28 17:59:04,237 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-m5aq1gj7 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2022-06-28 09:59:12,004 [INFO] root: Building exporter object.
2022-06-28 09:59:13,483 [INFO] root: Exporting the model.
2022-06-28 09:59:13,483 [INFO] root: Using input nodes: ['input_1']
2022-06-28 09:59:13,483 [INFO] root: Using output nodes: ['output_cov/Sigmoid', 'output_bbox/BiasAdd']
2022-06-28 09:59:13,483 [INFO] iva.common.export.keras_exporter: Using input nodes: ['input_1']
2022-06-28 09:59:13,483 [INFO] iva.common.export.keras_exporter: Using output nodes: ['output_cov/Sigmoid', 'output_bbox/BiasAdd']
/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
NOTE: UFF has been tested with TensorFlow 1.14.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
DEBUG [/usr/local/lib/python3.6/dist-packages/uff/converters/tensorflow/converter.py:96] Marking ['output_cov/Sigmoid', 'output_bbox/BiasAdd'] as outputs
2022-06-28 09:59:18,624 [INFO] iva.common.export.keras_exporter: Calibration takes time especially if number of batches is large.
2022-06-28 09:59:18,624 [INFO] root: Calibration takes time especially if number of batches is large.
terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  IndexError: tuple index out of range

At:
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(75): get_data_from_source
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py(95): get_batch
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(536): __init__
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/export/_tensorrt.py(695): __init__
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/keras_exporter.py(436): export
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(247): run_export
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/app.py(265): launch_export
  /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/export.py(12): <module>

Aborted (core dumped)
2022-06-28 17:59:22,855 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Set --batch_size 1 and retry.

I set

 --batch_size 1

and

 --batch_size 1
--max_batch_size 1

all of this cannot solve this error.

I train lpd followed by ‘Creating a Real-Time License Plate Detection and Recognition App | NVIDIA Technical Blog’, there is no error in this step, but I modify train config file as this:

################################################################################
# The MIT License (MIT)
#
# Copyright (c) 2019-2021 NVIDIA CORPORATION
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
################################################################################

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tao-experiments/data/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tao-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
    key: "lpd"
    value: "lpd"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 720
    output_image_height: 1168
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "lpd"
    value {
      clustering_config {
        clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.9
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.20000000298
        dbscan_min_samples: 0.0500000007451
        minimum_bounding_box_height: 4
      }
    }
  }
}
model_config {
  pretrained_model_file: "/workspace/tao-experiments/detectnet_v2/ccpd_unpruned.tlt"
  num_layers: 18
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 1
  minimum_detection_ground_truth_overlap {
    key: "lpd"
    value: 0.699999988079
  }
  evaluation_box_config {
    key: "lpd"
    value {
      minimum_height: 10
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "lpd"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: true
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 4
  num_epochs: 120
  #enable_qat: False
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-06
      max_learning_rate: 5e-04
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "lpd"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}

I add

        clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.9

and delete

       enable_qat: False

This error occurs.

Thank you, I have solved this err, becasue I modify -m in ’ Int8 Optimization’.
Thank you very much.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.