One class missing from tfrecords- Training stops with mAP equal to 0

Please provide the following information when requesting support.

• Hardware: GCP A100 GPU-enabled machine
• Network Type: Detectnet_v2

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit-tf’, ‘nvidia/tao/tao-toolkit-pyt’, ‘nvidia/tao/tao-toolkit-lm’]
format_version: 2.0
toolkit_version: 3.22.02

Hi
We are trying to retrain Detectnet_v2 with resnet34 architecture. In our dataset, we have 4 classes: female_adult, male _adult, male_child, female_child. But for the records, we noticed that it was not generated for the class female_adult, and the class male-adult was marked twice. The following is the message we get:

Converting Tfrecords for kitti trainval dataset
2022-03-27 08:41:55,344 [INFO] root: Registry: [‘nvcr.io’]
2022-03-27 08:41:55,433 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-03-27 08:41:55,448 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/janet/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2022-03-27 08:42:02,514 [INFO] iva.detectnet_v2.dataio.build_converter: Instantiating a kitti converter
2022-03-27 08:42:02,515 [INFO] root: Instantiating a kitti converter
2022-03-27 08:42:02,515 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Creating output directory /workspace/tao-experiments/data/tfrecords/kitti_trainval
2022-03-27 08:42:02,515 [INFO] root: Generating partitions
2022-03-27 08:42:02,517 [INFO] iva.detectnet_v2.dataio.kitti_converter_lib: Num images in
Train: 463 Val: 75
2022-03-27 08:42:02,517 [INFO] root: Num images in
Train: 463 Val: 75
2022-03-27 08:42:02,517 [INFO] iva.detectnet_v2.dataio.kitti_converter_lib: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2022-03-27 08:42:02,517 [INFO] root: Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2022-03-27 08:42:02,517 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 0
2022-03-27 08:42:02,517 [INFO] root: Writing partition 0, shard 0
WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:161: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2022-03-27 08:42:02,518 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/75913d2aee35770fa76c4a63d877f3aa/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:161: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2022-03-27 08:42:02,531 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 1
2022-03-27 08:42:02,531 [INFO] root: Writing partition 0, shard 1
2022-03-27 08:42:02,538 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 2
2022-03-27 08:42:02,538 [INFO] root: Writing partition 0, shard 2
2022-03-27 08:42:02,544 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 3
2022-03-27 08:42:02,544 [INFO] root: Writing partition 0, shard 3
2022-03-27 08:42:02,551 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 4
2022-03-27 08:42:02,551 [INFO] root: Writing partition 0, shard 4
2022-03-27 08:42:02,557 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 5
2022-03-27 08:42:02,558 [INFO] root: Writing partition 0, shard 5
2022-03-27 08:42:02,564 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 6
2022-03-27 08:42:02,564 [INFO] root: Writing partition 0, shard 6
2022-03-27 08:42:02,571 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 7
2022-03-27 08:42:02,571 [INFO] root: Writing partition 0, shard 7
2022-03-27 08:42:02,578 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 8
2022-03-27 08:42:02,578 [INFO] root: Writing partition 0, shard 8
2022-03-27 08:42:02,584 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 0, shard 9
2022-03-27 08:42:02,584 [INFO] root: Writing partition 0, shard 9
2022-03-27 08:42:02,595 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib:
Wrote the following numbers of objects:

2022-03-27 08:42:02,595 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 0
2022-03-27 08:42:02,595 [INFO] root: Writing partition 1, shard 0
2022-03-27 08:42:02,638 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 1
2022-03-27 08:42:02,638 [INFO] root: Writing partition 1, shard 1
2022-03-27 08:42:02,679 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 2
2022-03-27 08:42:02,679 [INFO] root: Writing partition 1, shard 2
2022-03-27 08:42:02,721 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 3
2022-03-27 08:42:02,722 [INFO] root: Writing partition 1, shard 3
2022-03-27 08:42:02,768 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 4
2022-03-27 08:42:02,768 [INFO] root: Writing partition 1, shard 4
2022-03-27 08:42:02,810 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 5
2022-03-27 08:42:02,810 [INFO] root: Writing partition 1, shard 5
2022-03-27 08:42:02,852 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 6
2022-03-27 08:42:02,852 [INFO] root: Writing partition 1, shard 6
2022-03-27 08:42:02,895 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 7
2022-03-27 08:42:02,895 [INFO] root: Writing partition 1, shard 7
2022-03-27 08:42:02,933 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 8
2022-03-27 08:42:02,933 [INFO] root: Writing partition 1, shard 8
2022-03-27 08:42:02,970 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Writing partition 1, shard 9
2022-03-27 08:42:02,970 [INFO] root: Writing partition 1, shard 9
/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:297: UserWarning: genfromtxt: Empty input file: “/workspace/tao-experiments/data/training/label_2/441.txt”
2022-03-27 08:42:03,009 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib:
Wrote the following numbers of objects:
b’male_adult’: 2
b’male_child’: 1
b’female_child’: 1

2022-03-27 08:42:03,010 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Cumulative object statistics
2022-03-27 08:42:03,010 [INFO] root: Cumulative object statistics
2022-03-27 08:42:03,010 [INFO] root: {
“male_adult”: 2,
“male_child”: 1,
“female_child”: 1
}
2022-03-27 08:42:03,010 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib:
Wrote the following numbers of objects:
b’male_adult’: 2
b’male_child’: 1
b’female_child’: 1

2022-03-27 08:42:03,010 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Class map.
Label in GT: Label in tfrecords file
b’male_adult’: b’male_adult’
b’male_child’: b’male_child’
b’female_child’: b’female_child’
2022-03-27 08:42:03,010 [INFO] root: Class map.
Label in GT: Label in tfrecords file
b’male_adult’: b’male_adult’
b’male_child’: b’male_child’
b’female_child’: b’female_child’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2022-03-27 08:42:03,010 [INFO] root: For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2022-03-27 08:42:03,010 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: Tfrecords generation complete.
2022-03-27 08:42:03,010 [INFO] root: TFRecords generation complete.
2022-03-27 08:42:04,132 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Your help would be appreciated. Thanks

Also

Please find below the training spec file and a message we get during the training, which eventually stops and the mAP we get are all zeros.
detectnet_v2_train_resnet34_kitti.txt.txt (6.8 KB)
Training messages.txt (328.5 KB)

Can you double check your training/val dataset?
According to above log, the class objects are quite less.

Hi @Morganh

In the training folder, we have all of the images that contain people from the 4 different classes. Our problem is that in the class objects it always comes missing one class which is female_adult, and male_adult has the number 2 in front of it, which we don’t quite understand.
This problem only occurs with a dataset that has our custom images. When we tried with images from the kitty dataset recommended in the Detectnet_v2 notebook, we didn’t face a problem with tfrecords.

I will send you privately different examples of the images we are using with their labels. Could you please check them?

Thank you

After checking the 1st image you shared, its label is not correct.
Please check your dataset.

Hi

As an update about the previous issues. As advised we checked the dataset and corrected the labels. Below is the message we get after generating tfrecords, we successfully get the 4 classes. Yet when training we get the message included in the train-error file below.
It is not clear why such error occurs. Your help would be much appreciated.
tfrecords-4classes.txt (8.0 KB)
train-error.txt (67.0 KB)

Thank you

Please mkdir a new result folder and retry.

Thank you, Morgan. We successfully finished the training.

We appreciate your help.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.