ZeroDivisionError when training peoplenet

meghpatel · August 26, 2020, 9:21am

I’m trying to train Peoplenet on custom images (960 X 544).

This is the command I’m running

!tlt-train detectnet_v2 -e $SPECS_DIR/config_peoplenet.txt -r “/tlt-experiments/dataset/” -k “tlt_encode” --gpus 1

This is the error I’m getting. I’ve checked TFRecords are properly generated. Can you help me to point out towards where the error could be?

Morganh · August 26, 2020, 9:37am

Please share the log when you generate tfreocrds.
More, please check if there is 0 size in each tfrecord file.

meghpatel · August 26, 2020, 9:39am

Now that I saw the log, I realised. Could it be because I have not used face and bag in my new training set and it is there in the config file?

Morganh · August 26, 2020, 9:41am

How many images in your dataset?

meghpatel · August 26, 2020, 9:42am

Close to 100.

I just checked some of the TFRecords are 0B.

Morganh · August 26, 2020, 9:54am

Please remove the 0B files.

meghpatel · August 26, 2020, 10:00am

I’m still getting the error.

Morganh · August 26, 2020, 10:02am

Please share the full log of tfrecords generation.
Also paste the command and its spec here.

meghpatel · August 26, 2020, 10:09am

Full TFRecords log

2020-08-25 15:00:51.094852: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Using TensorFlow backend.
2020-08-25 15:00:54,357 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-08-25 15:00:54,358 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 81 Val: 8
2020-08-25 15:00:54,358 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-08-25 15:00:54,358 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-08-25 15:00:54,358 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-08-25 15:00:54,359 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-08-25 15:00:54,378 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 26

2020-08-25 15:00:54,378 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-08-25 15:00:54,388 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-08-25 15:00:54,398 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-08-25 15:00:54,409 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-08-25 15:00:54,420 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-08-25 15:00:54,433 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-08-25 15:00:54,443 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-08-25 15:00:54,454 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-08-25 15:00:54,464 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-08-25 15:00:54,472 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 268

2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’person’: 294

2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’person’: b’person’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-08-25 15:00:54,483 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

Conversion Spec file

kitti_config {
root_directory_path: “/workspace/tlt-experiments/data”
image_dir_name: “images”
label_dir_name: “labels”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 10
num_shards: 10 }
image_directory_path: “/workspace/tlt-experiments/data/images”

Command

!tlt-dataset-convert -d $SPECS_DIR/conversion.txt -o $SPECS_DIR/tfrecords/peopletrain/peopletrain

Training Config file

config_peoplenet (1).txt (1.3 KB)

Morganh · August 27, 2020, 5:59am

Your val images is only 8. It is smaller than num_shards.
Please add more val images or set smaller num_shards.

Please refer to below.

val_images is (val_split)% of total images.train_images is (100-val_split)% of total images.

Please make sure below at the same time.

val_images >= num_shards

train_images >= num_shards

Topic		Replies	Views
Error when training detectnet_v2 resnet34 on tfrecord file TAO Toolkit	7	496	October 19, 2022
One class missing from tfrecords- Training stops with mAP equal to 0 TAO Toolkit	8	587	April 3, 2022
An error occurred while training with TLT TAO Toolkit	11	695	October 12, 2021
No detections after training PeopleNet using custom labeled data TAO Toolkit	7	867	October 12, 2021
Error Facing in Training command TAO Toolkit	13	951	March 9, 2022
Peoplenet model training not getting started TAO Toolkit	2	302	April 5, 2024
Custom dataset-ValueError: steps_per_epoch must be > 0 TAO Toolkit	5	809	October 12, 2021
Error when using tlt-dataset-convert TAO Toolkit	3	477	October 12, 2021
Empty TFRecords Being created From to detectnet_v2 dataset convert TAO Toolkit	8	1293	February 28, 2022
Retraining peoplenet model with own images TAO Toolkit	43	1577	October 12, 2021

ZeroDivisionError when training peoplenet

Related topics