Custom dataset-ValueError: steps_per_epoch must be > 0

hello I am getting this error “ValueError: steps_per_epoch must be > 0” when running this command:
!tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-n pld_detector
–gpus $NUM_GPUS
in notebook .

The full log is :
Using TensorFlow backend.
2021-08-20 10:49:53.686204: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-08-20 10:49:56.241289: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-08-20 10:49:56.277385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: Quadro RTX 4000 major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:65:00.0
2021-08-20 10:49:56.277431: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-08-20 10:49:56.277496: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-08-20 10:49:56.278523: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-08-20 10:49:56.278895: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-08-20 10:49:56.280191: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-08-20 10:49:56.281207: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-08-20 10:49:56.281298: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-08-20 10:49:56.282006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2021-08-20 10:49:56.282037: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-08-20 10:49:56.877838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-08-20 10:49:56.877882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2021-08-20 10:49:56.877890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2021-08-20 10:49:56.878919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6174 MB memory) → physical GPU (device: 0, name: Quadro RTX 4000, pci bus id: 0000:65:00.0, compute capability: 7.5)
2021-08-20 10:49:56,880 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2021-08-20 10:49:56,884 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tlt-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 569, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/cost_function/cost_auto_weight_hook.py”, line 26, in build_cost_auto_weight_hook
ValueError: steps_per_epoch must be > 0

Here is my specs file:
random_seed: 42
dataset_config {
data_sources {
tfrecords_path: “/workspace/tlt-experiments/data/data/tfrecords/kitti_trainval/*”
image_directory_path: “/workspace/tlt-experiments/data/data/”
}
image_extension: “jpeg”
target_class_mapping {
key: “lpd”
value: “lpd”
}
validation_fold: 0
}
augmentation_config {
preprocessing {
output_image_width: 720
output_image_height: 1168
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config {
target_class_config {
key: “lpd”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 0.0500000007451
minimum_bounding_box_height: 4
}
}
}
}
model_config {
pretrained_model_file: “/workspace/tlt-experiments/detectnet_v2/pretrained/ccpd_unpruned.tlt”
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
arch: “resnet”
}
evaluation_config {
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “lpd”
value: 0.699999988079
}
evaluation_box_config {
key: “lpd”
value {
minimum_height: 10
maximum_height: 9999
minimum_width: 10
maximum_width: 9999
}
}
average_precision_mode: INTEGRATE
}
cost_function_config {
target_classes {
name: “lpd”
class_weight: 1.0
coverage_foreground_weight: 0.0500000007451
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: true
max_objective_weight: 0.999899983406
min_objective_weight: 9.99999974738e-05
}
training_config {
batch_size_per_gpu: 4
num_epochs: 120
enable_qat: False
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 5e-04
soft_start: 0.10000000149
annealing: 0.699999988079
}
}
regularizer {
type: L1
weight: 3.00000002618e-09
}
optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
bbox_rasterizer_config {
target_class_config {
key: “lpd”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.40000000596
cov_radius_y: 0.40000000596
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.400000154972
}

I am using tlt-stremanalytics:v2.0_py3 and follow the Creating a Real-Time License Plate Detection and Recognition App | NVIDIA Technical Blog to do license plate recognition.

How many images in your training dataset?

6911images in total and I split the data into two parts: 80% for the training set and 20% for the validation set:
kitti_config {
root_directory_path: “/workspace/tlt-experiments/data/data”
image_dir_name: “image”
label_dir_name: “label”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 20
num_shards: 4
}

for more information, I found that all my tfrecords files are empty, but creating the tfrecords using the tlt-dataset-convert runs fine.
Converting Tfrecords for kitti trainval dataset
2021-08-23 02:08:53.074311: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Using TensorFlow backend.
2021-08-23 02:08:55,280 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2021-08-23 02:08:55,284 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Creating output directory /workspace/tlt-experiments/data/data/tfrecords/kitti_trainval
2021-08-23 02:08:55,308 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 0 Val: 0
2021-08-23 02:08:55,308 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2021-08-23 02:08:55,308 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2021-08-23 02:08:55,308 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2021-08-23 02:08:55,322 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2021-08-23 02:08:55,329 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2021-08-23 02:08:55,337 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2021-08-23 02:08:55,344 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:

2021-08-23 02:08:55,344 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2021-08-23 02:08:55,352 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2021-08-23 02:08:55,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2021-08-23 02:08:55,366 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2021-08-23 02:08:55,373 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:

2021-08-23 02:08:55,373 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2021-08-23 02:08:55,373 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:

2021-08-23 02:08:55,374 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2021-08-23 02:08:55,374 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

Never mind. I miswrote .jpg for image_extension, which is .jpeg actually

1 Like