ZeroDivisionError when training peoplenet

I’m trying to train Peoplenet on custom images (960 X 544).

This is the command I’m running

!tlt-train detectnet_v2 -e $SPECS_DIR/config_peoplenet.txt -r “/tlt-experiments/dataset/” -k “tlt_encode” --gpus 1

This is the error I’m getting. I’ve checked TFRecords are properly generated. Can you help me to point out towards where the error could be?

Please share the log when you generate tfreocrds.
More, please check if there is 0 size in each tfrecord file.

Now that I saw the log, I realised. Could it be because I have not used face and bag in my new training set and it is there in the config file?

How many images in your dataset?

Close to 100.

I just checked some of the TFRecords are 0B.

Please remove the 0B files.

I’m still getting the error.

Please share the full log of tfrecords generation.
Also paste the command and its spec here.

Full TFRecords log

2020-08-25 15:00:51.094852: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Using TensorFlow backend.
2020-08-25 15:00:54,357 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-08-25 15:00:54,358 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 81 Val: 8
2020-08-25 15:00:54,358 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-08-25 15:00:54,358 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-08-25 15:00:54,358 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-08-25 15:00:54,378 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 26

2020-08-25 15:00:54,378 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-08-25 15:00:54,388 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-08-25 15:00:54,398 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-08-25 15:00:54,409 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-08-25 15:00:54,420 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-08-25 15:00:54,433 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-08-25 15:00:54,443 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-08-25 15:00:54,454 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-08-25 15:00:54,464 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-08-25 15:00:54,472 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 268

2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 294

2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’person’: b’person’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

Conversion Spec file

kitti_config {
root_directory_path: “/workspace/tlt-experiments/data”
image_dir_name: “images”
label_dir_name: “labels”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 10
num_shards: 10 }
image_directory_path: “/workspace/tlt-experiments/data/images”

Command

!tlt-dataset-convert -d $SPECS_DIR/conversion.txt -o $SPECS_DIR/tfrecords/peopletrain/peopletrain

Training Config file

config_peoplenet (1).txt (1.3 KB)

Your val images is only 8. It is smaller than num_shards.
Please add more val images or set smaller num_shards.

Please refer to below.

val_images is (val_split)% of total images.train_images is (100-val_split)% of total images.

Please make sure below at the same time.

  1. val_images >= num_shards
  2. train_images >= num_shards