Trining TAO Toolkit results in 0.0000% accuracy

oleg.s · January 30, 2024, 5:54pm

Please provide the following information when requesting support.

• Hardware (NVIDIA GeForce GTX 1650)
• Network Type (Detectnet_v2)
• Training spec file (lpd_train_resnet18_kitti.txt (3.4 KB)
Dataset used: License Plate Recognition Object Detection Dataset and Pre-Trained Model by Roboflow Universe Projects

• How to reproduce the issue ?

I wanted to train my own Licence Plate Detection System in NVIDIA TAO.

I downloaded the dataset from the link above and followed the steps from the sample notebook with detecnet_v2
I managed to create the training data sample, needed to clean up as some images had no labels and were able to create the tf records
I installed ngc cli and could download the pretrained model
I created my own training specification file (and already modified quite a few values)

→ However, I always get 0.0000% accuracy after training…

Validation cost: 0.000010
Mean average_precision (in %): 0.0000

+------------+--------------------------+
| class name | average precision (in %) |
+------------+--------------------------+
|    lpd     |           0.0            |
+------------+--------------------------+

Median Inference Time: 0.025054
2024-01-29 18:32:20,893 [TAO Toolkit] [INFO] root 2102: Evaluation metrics generated.
2024-01-29 18:32:20,893 [TAO Toolkit] [INFO] root 2102: Training loop completed.
2024-01-29 18:32:20,894 [TAO Toolkit] [INFO] root 2102: Saving trained model.
2024-01-29 18:32:21,056 [TAO Toolkit] [INFO] root 2102: Model saved.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

But the training loss goes down form epoch to epoch


INFO:tensorflow:epoch = 0.00043122035360068997, learning_rate = 5.1002854e-07, loss = 0.08813875, step = 2 (329.480 sec)
2024-01-29 18:08:51,000 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.00043122035360068997, learning_rate = 5.1002854e-07, loss = 0.08813875, step = 2 (329.480 sec)

...

INFO:tensorflow:epoch = 0.9911599827511859, learning_rate = 5.7266834e-07, loss = 0.00061230396, step = 4597 (5.294 sec)
2024-01-29 18:30:12,241 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9911599827511859, learning_rate = 5.7266834e-07, loss = 0.00061230396, step = 4597 (5.294 sec)

There is probably an issue with configuration file but I cannot really spot it…

There were already a similar post (Mean average precision of 0.00 for detectnet_v2 using Tao Toolkit) and I tried to follow the hints but it did not help me.

Best regards

Morganh · January 31, 2024, 3:14am

Please refer to the lpd training spec file in deepstream_tao_apps/misc/dev_blog/LPDR/lpd/SPECS_train.txt at release/tlt3.0 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub. It will load lpd pretrained model. See line74 deepstream_tao_apps/misc/dev_blog/LPDR/lpd/SPECS_train.txt at release/tlt3.0 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub.
And please

add
enable_auto_resize: true after line 44 deepstream_tao_apps/misc/dev_blog/LPDR/lpd/SPECS_train.txt at release/tlt3.0 · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub.
Modify all the dbscan_min_samples: 0.0500000007451 to dbscan_min_samples: 1

Morganh · January 31, 2024, 3:33am

Update above comment.

oleg.s · February 1, 2024, 12:30pm

Hello, many thanks for the quick response.

New config file:
lpd_train_resnet18_kitti_v3.txt (3.2 KB)

I downloaded the model from here and put it in the respective folder: LPDNet | NVIDIA NGC

Unfotunately, this won’t help. The training does not even start:

2024-02-01 13:28:08,504 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2024-02-01 13:28:08,603 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 360: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2024-02-01 13:28:08,719 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2024-02-01 12:28:09.472386: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2024-02-01 12:28:09,519 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2024-02-01 12:28:10,857 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:10,890 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:10,894 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:12,230 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-4bd30_gr because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2024-02-01 12:28:12,456 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:14,396 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:14,426 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:14,430 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2024-02-01 12:28:15,756 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.logging.logging 197: Log file already exists at /workspace/tao-experiments/experiment/experiment_dir_unpruned/status.json
2024-02-01 12:28:15,756 [TAO Toolkit] [INFO] root 2102: Starting DetectNet_v2 Training job
2024-02-01 12:28:15,756 [TAO Toolkit] [INFO] __main__ 817: Loading experiment spec at /workspace/tao-experiments/experiment/specs/lpd_train_resnet18_kitti_v3.txt.
2024-02-01 12:28:15,756 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.spec_handler.spec_loader 113: Merging specification from /workspace/tao-experiments/experiment/specs/lpd_train_resnet18_kitti_v3.txt
2024-02-01 12:28:15,760 [TAO Toolkit] [INFO] root 2102: Training gridbox model.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-01 12:28:15,760 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2024-02-01 12:28:16,977 [TAO Toolkit] [INFO] root 522: Sampling mode of the dataloader was set to user_defined.
2024-02-01 12:28:16,978 [TAO Toolkit] [INFO] __main__ 99: Cannot iterate over exactly 18551 samples with a batch size of 4; each epoch will therefore take one extra step.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/cost_function/cost_auto_weight_hook.py:122: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

2024-02-01 12:28:16,979 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/cost_function/cost_auto_weight_hook.py:122: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/cost_function/cost_auto_weight_hook.py:125: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

2024-02-01 12:28:16,979 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/cost_function/cost_auto_weight_hook.py:125: The name tf.get_variable is deprecated. Please use tf.compat.v1.get_variable instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/cost_function/cost_auto_weight_hook.py:128: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2024-02-01 12:28:16,982 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/cost_function/cost_auto_weight_hook.py:128: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

2024-02-01 12:28:16,999 [TAO Toolkit] [INFO] root 2102: Building DetectNet V2 model
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2024-02-01 12:28:16,999 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2024-02-01 12:28:17,001 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2024-02-01 12:28:17,017 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2024-02-01 12:28:17,671 [TAO Toolkit] [INFO] __main__ 1032: Training was interrupted.
2024-02-01 12:28:17,672 [TAO Toolkit] [INFO] root 2102: Training was interrupted
Time taken to run __main__:main: 0:00:02.234440.
Execution status: PASS
2024-02-01 13:28:24,157 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Morganh · February 2, 2024, 1:19am

Please add -k nvidia_tlt in the command line since the .tlt model is encrypted with key “nvidia_tlt”. See LPDNet | NVIDIA NGC.

oleg.s · February 19, 2024, 9:31am

Unfortunately, it did not work…

This is the config: lpd_train_resnet18_kitti_v3.txt (3.2 KB)

This is how it was executed:

!tao model detectnet_v2 train -e $SPECS_DIR/lpd_train_resnet18_kitti_v3.txt \
                        -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                        -n resnet18_detector \
                        -k nvidia_tlt \
                        --gpus $NUM_GPUS

Morganh · February 23, 2024, 3:25am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

From your spec file,

    key: "License_Plate"
    value: "lpd"

could you share one of your label file?
Is the class name “License_Plate” ?

If your label has a class name of “License_Plate”, you need to set below in the config file.

    key: "License_Plate"
    value: "License_Plate"

That means, the value should be the same as the actual class name.

system · March 19, 2024, 2:05am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TAO 5.0 failed to train TAO Toolkit	8	534	August 1, 2023
Tao toolkit observations TAO Toolkit	56	810	May 29, 2024
Training Failure for License Plate Detection Toturial TAO Toolkit training , tao	5	1086	October 13, 2021
Detectnetv2 tfrecords error TAO Toolkit	4	417	January 13, 2024
Creating a Real-Time License Plate Detection and Recognition App TAO Toolkit	8	597	June 20, 2022
Error while training detectnet v2 taotollkit on default notebook TAO Toolkit	2	307	March 9, 2024
Tao toolkit detectnet training kitty format error TAO Toolkit	10	413	December 8, 2023
Detectnet2 TAO Toolkit model training fail on formating dataset on kitti format TAO Toolkit	69	950	January 22, 2024
Tao model detectnet_v2 dataset_convert : ValueError: could not convert string to float: 'fallback"' TAO Toolkit	2	162	May 20, 2024
License Plate Recognition TAO Toolkit	14	1227	July 4, 2022

Trining TAO Toolkit results in 0.0000% accuracy

Related topics