Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
Azure vm A100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2 resnet18
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
random_seed: 42
dataset_config {
data_sources {
tfrecords_path: "/workspace/tao-experiments/data/tfrecords/kitti_trainval/*"
image_directory_path: "/workspace/tao-experiments/data/training"
image_extension: "jpg"
target_class_mapping {
key: "balls"
value: "balls"
validation_fold: 0
augmentation_config {
preprocessing {
output_image_width: 1280
output_image_height: 720
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
postprocessing_config {
target_class_config {
key: "balls"
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 1
minimum_bounding_box_height: 20
model_config {
pretrained_model_file: "/workspace/tao-experiments/detectnet_v2/pretrained_resnet18/pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
cov {
arch: "resnet"
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 30
minimum_detection_ground_truth_overlap {
key: "balls"
value: 0.699999988079
evaluation_box_config {
key: "ball"
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
average_precision_mode: INTEGRATE
cost_function_config {
target_classes {
name: "balls"
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: "cov"
initial_weight: 1.0
weight_target: 1.0
objectives {
name: "bbox"
initial_weight: 10.0
weight_target: 10.0
enable_autoweighting: false
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
training_config {
batch_size_per_gpu: 4
num_epochs: 120
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-07
max_learning_rate: 5e-05
soft_start: 0.10000000149
annealing: 0.699999988079
regularizer {
type: L1
weight: 3.00000002618e-09
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
enabled: true
num_images: 3
scalar_logging_frequency: 50
infrequent_logging_frequency: 5
target_class_config {
key: "ball"
value: {
coverage_threshold: 0.005
checkpoint_interval: 10
bbox_rasterizer_config {
target_class_config {
key: "balls"
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
deadzone_radius: 0.400000154972
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
when i go to train it something fails and im not sure what the error means, is this something im doing wrong or is it TAO? thank you.
!tao model detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-n resnet18_detector \
--gpus $NUM_GPUS
2023-07-31 22:59:22,715 [TAO Toolkit] [INFO] root 160: Registry: ['']
2023-07-31 22:59:22,767 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container:
2023-07-31 22:59:22,775 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-07-31 22:59:28.050353: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
2023-07-31 22:59:28,086 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-07-31 22:59:29,230 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:29,260 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:29,263 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:31,493 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,102 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,129 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,132 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:35,983 [TAO Toolkit] [INFO] 197: Log file already exists at /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/status.json
2023-07-31 22:59:35,984 [TAO Toolkit] [INFO] root 2102: Starting DetectNet_v2 Training job
2023-07-31 22:59:35,984 [TAO Toolkit] [INFO] __main__ 817: Loading experiment spec at /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2023-07-31 22:59:35,985 [TAO Toolkit] [INFO] 113: Merging specification from /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
2023-07-31 22:59:35,988 [TAO Toolkit] [INFO] root 2102: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/ The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2023-07-31 22:59:35,988 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/ The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2023-07-31 22:59:36,004 [TAO Toolkit] [INFO] root 2102: corrupted record at 0
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/", line 1067, in <module>
raise e
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/", line 1046, in <module>
File "/usr/local/lib/python3.8/dist-packages/", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/utilities/", line 46, in wrapped_fn
return_args = fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/", line 1024, in main
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/", line 887, in run_experiment
train_gridbox(results_dir, experiment_spec, output_model_file_name, input_model_file_name,
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/", line 658, in train_gridbox
dataloader = build_dataloader(dataset_proto=dataset_proto,
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/", line 277, in build_dataloader
return DATALOADER[dataloader_mode](**dataloader_kwargs)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/", line 501, in __init__
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/", line 545, in _construct_data_sources
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/", line 404, in __init__
self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/", line 404, in <listcomp>
self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/", line 404, in <genexpr>
self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/lib/io/", line 181, in tf_record_iterator
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/", line 1034, in GetNext
return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
Execution status: FAIL
2023-07-31 22:59:40,029 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.