Please provide the following information when requesting support.
• Hardware (T4/V100/Xavier/Nano/etc)
Azure vm A100
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc)
Detectnet_v2 resnet18
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
• Training spec file(If have, please share here)
random_seed: 42
dataset_config {
data_sources {
tfrecords_path: "/workspace/tao-experiments/data/tfrecords/kitti_trainval/*"
image_directory_path: "/workspace/tao-experiments/data/training"
}
image_extension: "jpg"
target_class_mapping {
key: "balls"
value: "balls"
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 1280
output_image_height: 720
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: "balls"
value {
clustering_config {
clustering_algorithm: DBSCAN
dbscan_confidence_threshold: 0.9
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 1
minimum_bounding_box_height: 20
}
}
}
}
model_config {
pretrained_model_file: "/workspace/tao-experiments/detectnet_v2/pretrained_resnet18/pretrained_detectnet_v2_vresnet18/resnet18.hdf5"
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
arch: "resnet"
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 30
minimum_detection_ground_truth_overlap {
key: "balls"
value: 0.699999988079
}
evaluation_box_config {
key: "ball"
value {
minimum_height: 20
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: "balls"
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: "cov"
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: "bbox"
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: false
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 4
num_epochs: 120
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-07
max_learning_rate: 5e-05
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
visualizer{
enabled: true
num_images: 3
scalar_logging_frequency: 50
infrequent_logging_frequency: 5
target_class_config {
key: "ball"
value: {
coverage_threshold: 0.005
}
}
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: "balls"
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)
when i go to train it something fails and im not sure what the error means, is this something im doing wrong or is it TAO? thank you.
cell:
!tao model detectnet_v2 train -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-n resnet18_detector \
--gpus $NUM_GPUS
output
2023-07-31 22:59:22,715 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2023-07-31 22:59:22,767 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2023-07-31 22:59:22,775 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 275: Printing tty value True
2023-07-31 22:59:28.050353: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-07-31 22:59:28,086 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-07-31 22:59:29,230 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:29,260 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:29,263 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:31,493 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,102 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,129 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:33,132 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable TF_ALLOW_IOLIBS=1.
2023-07-31 22:59:35,983 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/detectnet_v2/experiment_dir_unpruned/status.json
2023-07-31 22:59:35,984 [TAO Toolkit] [INFO] root 2102: Starting DetectNet_v2 Training job
2023-07-31 22:59:35,984 [TAO Toolkit] [INFO] __main__ 817: Loading experiment spec at /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2023-07-31 22:59:35,985 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.spec_handler.spec_loader 113: Merging specification from /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
2023-07-31 22:59:35,988 [TAO Toolkit] [INFO] root 2102: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2023-07-31 22:59:35,988 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
2023-07-31 22:59:36,004 [TAO Toolkit] [INFO] root 2102: corrupted record at 0
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1067, in <module>
raise e
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1046, in <module>
main()
File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/utilities/timer.py", line 46, in wrapped_fn
return_args = fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 1024, in main
run_experiment(
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 887, in run_experiment
train_gridbox(results_dir, experiment_spec, output_model_file_name, input_model_file_name,
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/train.py", line 658, in train_gridbox
dataloader = build_dataloader(dataset_proto=dataset_proto,
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/build_dataloader.py", line 277, in build_dataloader
return DATALOADER[dataloader_mode](**dataloader_kwargs)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 501, in __init__
self._construct_data_sources(self.training_data_sources)
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 545, in _construct_data_sources
DriveNetTFRecordsDataSource(
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 404, in __init__
self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 404, in <listcomp>
self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/dataloader/drivenet_dataloader.py", line 404, in <genexpr>
self.num_samples = sum([sum(1 for _ in tf.compat.v1.python_io.tf_record_iterator(filename))
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/lib/io/tf_record.py", line 181, in tf_record_iterator
reader.GetNext()
File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 1034, in GetNext
return _pywrap_tensorflow_internal.PyRecordReader_GetNext(self)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 0
Execution status: FAIL
2023-07-31 22:59:40,029 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.