TAO 5.0 failed to train

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)
Azure vm A100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2 resnet18
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)

random_seed: 42
dataset_config {
  data_sources {
    tfrecords_path: "/workspace/tao-experiments/data/tfrecords/kitti_trainval/*"
    image_directory_path: "/workspace/tao-experiments/data/training"
  }
  image_extension: "jpg"
  target_class_mapping {
    key: "balls"
    value: "balls"
  }
  validation_fold: 0
}
augmentation_config {
  preprocessing {
    output_image_width: 1280
    output_image_height: 720
    min_bbox_width: 1.0
    min_bbox_height: 1.0
    output_image_channel: 3
  }
  spatial_augmentation {
    hflip_probability: 0.5
    zoom_min: 1.0
    zoom_max: 1.0
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
  color_augmentation {
    hue_rotation_max: 25.0
    saturation_shift_max: 0.20000000298
    contrast_scale_max: 0.10000000149
    contrast_center: 0.5
  }
}
postprocessing_config {
  target_class_config {
    key: "balls"
    value {
      clustering_config {
        clustering_algorithm: DBSCAN
        dbscan_confidence_threshold: 0.9
        coverage_threshold: 0.00499999988824
        dbscan_eps: 0.20000000298
        dbscan_min_samples: 1
        minimum_bounding_box_height: 20
      }
    }
  }
}
model_config {
  pretrained_model_file: "/workspace/tao-experiments/detectnet_v2/pretrained_resnet18/pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
  num_layers: 18
  use_batch_norm: true
  objective_set {
    bbox {
      scale: 35.0
      offset: 0.5
    }
    cov {
    }
  }
  arch: "resnet"
}
evaluation_config {
  validation_period_during_training: 10
  first_validation_epoch: 30
  minimum_detection_ground_truth_overlap {
    key: "balls"
    value: 0.699999988079
  }
  evaluation_box_config {
    key: "ball"
    value {
      minimum_height: 20
      maximum_height: 9999
      minimum_width: 10
      maximum_width: 9999
    }
  }
  average_precision_mode: INTEGRATE
}
cost_function_config {
  target_classes {
    name: "balls"
    class_weight: 1.0
    coverage_foreground_weight: 0.0500000007451
    objectives {
      name: "cov"
      initial_weight: 1.0
      weight_target: 1.0
    }
    objectives {
      name: "bbox"
      initial_weight: 10.0
      weight_target: 10.0
    }
  }
  enable_autoweighting: false
  max_objective_weight: 0.999899983406
  min_objective_weight: 9.99999974738e-05
}
training_config {
  batch_size_per_gpu: 4
  num_epochs: 120
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-07
      max_learning_rate: 5e-05
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
  cost_scaling {
    initial_exponent: 20.0
    increment: 0.005
    decrement: 1.0
  }
  visualizer{
    enabled: true
    num_images: 3
    scalar_logging_frequency: 50
    infrequent_logging_frequency: 5
    target_class_config {
      key: "ball"
      value: {
        coverage_threshold: 0.005
      }
    }
  }
  checkpoint_interval: 10
}
bbox_rasterizer_config {
  target_class_config {
    key: "balls"
    value {
      cov_center_x: 0.5
      cov_center_y: 0.5
      cov_radius_x: 0.40000000596
      cov_radius_y: 0.40000000596
      bbox_min_radius: 1.0
    }
  }
  deadzone_radius: 0.400000154972
}

• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

when i go to train it something fails and im not sure what the error means, is this something im doing wrong or is it TAO? thank you.

cell:

!tao model detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                        -n resnet18_detector \
                        --gpus $NUM_GPUS

output

2023-07-31 22:59:22,715 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-07-31 22:59:22,767 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-07-31 22:59:22,775 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-07-31 22:59:28.050353: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-07-31 22:59:28,086 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-07-31 22:59:29,230 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:29,260 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:29,263 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:31,493 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,102 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,129 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,132 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:35,983 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/status.json
2023-07-31 22:59:35,984 [TAO Toolkit] [INFO] root 2102: Starting DetectNet_v2 Training job
2023-07-31 22:59:35,984 [TAO Toolkit] [INFO] __main__ 817: Loading experiment spec at /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2023-07-31 22:59:35,985 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.spec_handler.spec_loader 113: Merging specification from /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
2023-07-31 22:59:35,988 [TAO Toolkit] [INFO] root 2102: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-07-31 22:59:35,988 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-07-31 22:59:36,004 [TAO Toolkit] [INFO] root 2102: corrupted record at 0
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1067, in <module>
    raise e
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1046, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
    return_args = fn(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1024, in main
    run_experiment(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 887, in run_experiment
    train_gridbox(results_dir, experiment_spec, output_model_file_name, input_model_file_name,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 658, in train_gridbox
    dataloader = build_dataloader(dataset_proto=dataset_proto,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/build_dataloader.py", line 277, in build_dataloader
    return DATALOADER[dataloader_mode](**dataloader_kwargs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 501, in __init__
    self._construct_data_sources(self.training_data_sources)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 545, in _construct_data_sources
    DriveNetTFRecordsDataSource(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 404, in __init__
    self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 404, in <listcomp>
    self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 404, in <genexpr>
    self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
    reader.GetNext()
  File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
    return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
Execution status: FAIL
2023-07-31 22:59:40,029 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

It is related to the tfrecord files.
Did you generate tfreocrd files successfully? If possible, please share the logs.

Also, please check the tfrecord files and delete any files which have 0 size.
$ ls -rltsh /workspace/tao-experiments/data/tfrecords/kitti_trainval/*

yes i have created them

cell

# Creating a new directory for the output tfrecords dump.
print("Converting Tfrecords for kitti trainval dataset")
!mkdir -p $LOCAL_DATA_DIR/tfrecords && rm -rf $LOCAL_DATA_DIR/tfrecords/*
!tao model detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval \
                  -r $USER_EXPERIMENT_DIR/

output

Converting Tfrecords for kitti trainval dataset
2023-07-31 22:56:58,208 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-07-31 22:56:58,259 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-07-31 22:56:58,324 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-07-31 22:57:10.026589: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-07-31 22:57:11,294 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-07-31 22:57:19,116 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:19,315 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:19,348 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:33,511 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:34,944 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:34,972 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:34,974 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2023-07-31 22:57:35,347 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/detectnet_v2/status.json
2023-07-31 22:57:35,347 [TAO Toolkit] [INFO] root 2102: Starting Object Detection Dataset Convert.
2023-07-31 22:57:35,348 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.build_converter 87: Instantiating a kitti converter
2023-07-31 22:57:35,348 [TAO Toolkit] [INFO] root 2102: Instantiating a kitti converter
2023-07-31 22:57:35,348 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 71: Creating output directory /workspace/tao-experiments/data/tfrecords/kitti_trainval
2023-07-31 22:57:35,348 [TAO Toolkit] [INFO] root 2102: Generating partitions
2023-07-31 22:57:35,370 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.kitti_converter_lib 176: Num images in
Train: 11568	Val: 1883
2023-07-31 22:57:35,371 [TAO Toolkit] [INFO] root 2102: Num images in
Train: 11568	Val: 1883
2023-07-31 22:57:35,371 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.kitti_converter_lib 197: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2023-07-31 22:57:35,371 [TAO Toolkit] [INFO] root 2102: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2023-07-31 22:57:35,374 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 0
2023-07-31 22:57:35,374 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 0
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataio/dataset_converter_lib.py:181: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2023-07-31 22:57:35,374 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataio/dataset_converter_lib.py:181: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2023-07-31 22:57:36,125 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 1
2023-07-31 22:57:36,126 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 1
2023-07-31 22:57:36,770 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 2
2023-07-31 22:57:36,770 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 2
2023-07-31 22:57:37,398 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 3
2023-07-31 22:57:37,399 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 3
2023-07-31 22:57:38,033 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 4
2023-07-31 22:57:38,033 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 4
2023-07-31 22:57:38,651 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 5
2023-07-31 22:57:38,651 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 5
2023-07-31 22:57:39,257 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 6
2023-07-31 22:57:39,257 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 6
2023-07-31 22:57:39,850 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 7
2023-07-31 22:57:39,850 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 7
2023-07-31 22:57:40,452 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 8
2023-07-31 22:57:40,452 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 8
2023-07-31 22:57:41,057 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 9
2023-07-31 22:57:41,057 [TAO Toolkit] [INFO] root 2102: Writing partition 0, shard 9
2023-07-31 22:57:41,649 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 250: 
Wrote the following numbers of objects:
b'balls': 1883

2023-07-31 22:57:41,649 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 0
2023-07-31 22:57:41,649 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 0
2023-07-31 22:57:45,077 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 1
2023-07-31 22:57:45,077 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 1
2023-07-31 22:57:48,223 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 2
2023-07-31 22:57:48,223 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 2
2023-07-31 22:57:51,168 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 3
2023-07-31 22:57:51,168 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 3
2023-07-31 22:57:53,844 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 4
2023-07-31 22:57:53,844 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 4
2023-07-31 22:57:56,249 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 5
2023-07-31 22:57:56,249 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 5
2023-07-31 22:57:58,502 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 6
2023-07-31 22:57:58,502 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 6
2023-07-31 22:58:00,703 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 7
2023-07-31 22:58:00,703 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 7
2023-07-31 22:58:02,635 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 8
2023-07-31 22:58:02,636 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 8
2023-07-31 22:58:04,452 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 9
2023-07-31 22:58:04,452 [TAO Toolkit] [INFO] root 2102: Writing partition 1, shard 9
2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 250: 
Wrote the following numbers of objects:
b'balls': 11575

2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 89: Cumulative object statistics
2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] root 2102: Cumulative object statistics
2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 250: 
Wrote the following numbers of objects:
b'balls': 13458

2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 105: Class map. 
Label in GT: Label in tfrecords file 
b'balls': b'balls'
2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] root 2102: Class map. 
Label in GT: Label in tfrecords file 
b'balls': b'balls'
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] root 2102: For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 114: Tfrecords generation complete.
2023-07-31 22:58:06,248 [TAO Toolkit] [INFO] root 2102: TFRecords generation complete.
2023-07-31 22:58:06,249 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 221: Writing the log_warning.json
2023-07-31 22:58:06,249 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 224: There were errors in the labels. Details are logged at /workspace/tao-experiments/data/tfrecords/kitti_trainval/kitti_trainval_waring.json
2023-07-31 22:58:06,249 [TAO Toolkit] [INFO] root 2102: Dataset convert finished successfully.
Execution status: PASS
2023-07-31 22:58:12,114 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

cell:

!ls -rlt $LOCAL_DATA_DIR/tfrecords/kitti_trainval/

output:

total 8532
-rw-r--r-- 1 root root 121654 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00000-of-00010
-rw-r--r-- 1 root root 121656 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00001-of-00010
-rw-r--r-- 1 root root 121654 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00002-of-00010
-rw-r--r-- 1 root root 121655 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00003-of-00010
-rw-r--r-- 1 root root 121665 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00004-of-00010
-rw-r--r-- 1 root root 121658 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00005-of-00010
-rw-r--r-- 1 root root 121661 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00006-of-00010
-rw-r--r-- 1 root root 121660 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00007-of-00010
-rw-r--r-- 1 root root 121661 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00008-of-00010
-rw-r--r-- 1 root root 123600 Jul 31 22:57 kitti_trainval-fold-000-of-002-shard-00009-of-00010
-rw-r--r-- 1 root root 747868 Jul 31 22:57 kitti_trainval-fold-001-of-002-shard-00000-of-00010
-rw-r--r-- 1 root root 747869 Jul 31 22:57 kitti_trainval-fold-001-of-002-shard-00001-of-00010
-rw-r--r-- 1 root root 747848 Jul 31 22:57 kitti_trainval-fold-001-of-002-shard-00002-of-00010
-rw-r--r-- 1 root root 747917 Jul 31 22:57 kitti_trainval-fold-001-of-002-shard-00003-of-00010
-rw-r--r-- 1 root root 747855 Jul 31 22:57 kitti_trainval-fold-001-of-002-shard-00004-of-00010
-rw-r--r-- 1 root root 747971 Jul 31 22:57 kitti_trainval-fold-001-of-002-shard-00005-of-00010
-rw-r--r-- 1 root root 747835 Jul 31 22:58 kitti_trainval-fold-001-of-002-shard-00006-of-00010
-rw-r--r-- 1 root root 747981 Jul 31 22:58 kitti_trainval-fold-001-of-002-shard-00007-of-00010
-rw-r--r-- 1 root root 747831 Jul 31 22:58 kitti_trainval-fold-001-of-002-shard-00008-of-00010
-rw-r--r-- 1 root root    308 Jul 31 22:58 kitti_trainval_warning.json
-rw-r--r-- 1 root root 753126 Jul 31 22:58 kitti_trainval-fold-001-of-002-shard-00009-of-00010

Can you delete this file and retry?

since the directory is written as root and im in a vm then it doesnt let me go in, can i change the permission or ownership without causing a problem?

There are some topics talking about this error. Search results for 'corrupted record at 0 #intelligent-video-analytics:tao-toolkit' - NVIDIA Developer Forums
In short, please make sure tfrecord files are available and also not 0 file-size.
$ tao model detectnet_v2 run ls -rltsh /workspace/tao-experiments/data/tfrecords/kitti_trainval/*

it is training now thank you.

would you know where i can find information on how to understand all the training parameters im able to adjust for the training?

Please refer to user guide DetectNet_v2 - NVIDIA Docs

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.