tlt-train using DetectNet_V2 : getting 0 as average precision for each classes

samjith888 · November 28, 2019, 10:39am

INFO:tensorflow:Saving checkpoints for step-19550.
2019-11-28 10:25:49,337 [INFO] tensorflow: Saving checkpoints for step-19550.
2019-11-28 10:25:50,819 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 97, 0.00s/step
2019-11-28 10:25:53,518 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 97, 0.27s/step
2019-11-28 10:25:56,181 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 97, 0.27s/step
2019-11-28 10:25:58,780 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 97, 0.26s/step
2019-11-28 10:26:01,464 [INFO] iva.detectnet_v2.evaluation.evaluation: step 40 / 97, 0.27s/step
2019-11-28 10:26:04,098 [INFO] iva.detectnet_v2.evaluation.evaluation: step 50 / 97, 0.26s/step
2019-11-28 10:26:06,764 [INFO] iva.detectnet_v2.evaluation.evaluation: step 60 / 97, 0.27s/step
2019-11-28 10:26:09,330 [INFO] iva.detectnet_v2.evaluation.evaluation: step 70 / 97, 0.26s/step
2019-11-28 10:26:11,954 [INFO] iva.detectnet_v2.evaluation.evaluation: step 80 / 97, 0.26s/step
2019-11-28 10:26:14,593 [INFO] iva.detectnet_v2.evaluation.evaluation: step 90 / 97, 0.26s/step
Epoch 50/120
=========================

Validation cost: -0.000009
Mean average_precision (in %): 0.0000

class name                    average precision (in %)
--------------------------  --------------------------
Cl                                   0
Fl                                   0
Ladders                              0
Plat                                 0
Stac                                 0
Stalls                               0
Sp                                   0

Median Inference Time: 0.065659

Epoch 55/120
=========================

Validation cost: -0.000009
Mean average_precision (in %): 0.0000

class name                    average precision (in %)
--------------------------  --------------------------
Cl                                  0
Fl                                  0
Ladders                             0
Plat                                0
Stac                                0
Stalls                              0
Sp                                  0

Please check the attached train config file

The images used for training have high resolution (4096 *2160)
label.txt will be like following

Fl 0.00 0.00 0 901.03808 635.71608 3158.048768 2160.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

augmentation_config {
preprocessing {
output_image_width: 768
output_image_height: 768
min_bbox_width: 2.0
min_bbox_height: 2.0
output_image_channel: 3
}
train_config.txt (9.2 KB)

Morganh · November 28, 2019, 3:19pm

Hi
I find several culprits.

Your label.txt is not expected. See Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation , the sum of the total number of elements per object is 15.
Do you generate tfrecord files successfully with “tlt-dataset-convert”?
Your attached training config file does not exactly match what you mentioned.
In your attachment,

output_image_width: 768
output_image_height: 768

Could you attach the correct config file?
Also, please see Integrating TAO Models into DeepStream — TAO Toolkit 3.22.05 documentation for the setting.

I find the class name in your config file do not match your mAP class name (Cl, Fl, etc)

target_classes {
    name: "xxx"

Please double check if the config file is the correct one.

samjith888 · November 28, 2019, 5:52pm

Hello,

Actually mAP class file was the same in config file,( I just shorten the class name while asking this question, i meant through editing this question). And also the output_image_width: 768 output_image_height: 768 , both are same like in config file . Still I’m getting 0 value for average precision.

Note : I have edited the question . Even label.txt used for training have 15 fields. And the classes are in the same name in config file.

Morganh · November 29, 2019, 1:42am

Hi samjith888,
The setting of output_image_width or output_image_height inside training config file should be exactly the same resolution of your training dataset.
Your mentioned that the images used for training have resolution (4096 *2160).
But your training config file set as below. It is not expected.

augmentation_config {
  preprocessing {
    output_image_width: 768
    output_image_height: 768

samjith888 · November 29, 2019, 3:37am

I’m getting the following error when i replace the augmentation config file with my input image resolution.

ResourceExhaustedError : OOM when allocating tensor with shape[4,64,1080,2048] and type float on /j…

Morganh · November 29, 2019, 3:47am

OK, so please paste your full log along with the running command into your another topic https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/resourceexhaustederror-oom-when-allocating-tensor-with-shape-4-64-1080-2048-and-type-float-on-j-/

Let’s track the OOM issue in that link. Thanks.

samjith888 · November 29, 2019, 4:23am

I have pasted there. Please look it there

samjith888 · November 29, 2019, 6:31am

Morganh:

Hi samjith888,
The setting of output_image_width or output_image_height inside training config file should be exactly the same resolution of your training dataset.
Your mentioned that the images used for training have resolution (4096 *2160).
But your training config file set as below. It is not expected.
augmentation_config {
  preprocessing {
    output_image_width: 768
    output_image_height: 768

Are you sure about the output_image_width and height values have to be replaced with my original training image resolution (4096 *2160) ?
I have gone through your answer https://devtalk.nvidia.com/default/topic/1067151/transfer-learning-toolkit/understanding-parameters-of-training-config/post/5405248/#5405248 Where you have mentioned about the same field with resized image resolution.

Morganh · November 30, 2019, 12:35am

Yes，if you resize the original dataset to 768768, it’s correct for you to set 768768 in the training spec.
But I saw your label file is still not changed accordingly. Its bbox needs resize too.

samjith888 · November 30, 2019, 2:27am

I didn’t want to resize , because its too difficult to make new label file for each images again.
Please help me out for the following issue
https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/-w-tensorflow-core-framework-op_kernel-cc-1401-op_requires-failed-at-pack_op-cc-88-resource-exhausted-oom-when-allocating-tensor-with-shape-32-3-2160-4096-and-type-float-on-job-localhost-replica-0-task-0-device-gpu-0-by-allocator-gpu_0_bfc/

Morganh · November 30, 2019, 3:20am

OK, if you do not change label, it is a must to set 4096*2160 in training spec before training.

samjith888 · November 30, 2019, 7:38am

when i set 4096*2160 , i got resource exhausted error as i mentioned in the below link

https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/-w-tensorflow-core-framework-op_kernel-cc-1401-op_requires-failed-at-pack_op-cc-88-resource-exhausted-oom-when-allocating-tensor-with-shape-32-3-2160-4096-and-type-float-on-job-localhost-replica-0-task-0-device-gpu-0-by-allocator-gpu_0_bfc/

Morganh · December 5, 2019, 2:56am

Let’s close this topic and check https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/-w-tensorflow-core-framework-op_kernel-cc-1401-op_requires-failed-at-pack_op-cc-88-resource-exhausted-oom-when-allocating-tensor-with-shape-32-3-2160-4096-and-type-float-on-job-localhost-replica-0-task-0-device-gpu-0-by-allocator-gpu_0_bfc/ instead.

Topic		Replies	Views
mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18)) TAO Toolkit	33	490	February 1, 2024
No detections after training PeopleNet using custom labeled data TAO Toolkit	7	867	October 12, 2021
Class Mapping in Detectnet_v2 TAO Toolkit	5	566	October 12, 2021
Detectnet_v2(resnet50) low accuracy on 2 class dataset TAO Toolkit	25	920	February 12, 2023
0.0 average precision during a detectnet_v2 training TAO Toolkit	10	496	September 28, 2023
Error detectnet_V2 train with TAO : dbscan_min_samples: 0.05' TAO Toolkit tao	4	388	November 7, 2023
Tlt detectnet training focusing on a particular class? TAO Toolkit	16	1305	October 12, 2021
unkown error by horovod TAO Toolkit	15	1634	October 12, 2021
Finding inaccurate result while testing model(TLT trained model) with deepstream TAO Toolkit	14	1027	October 12, 2021
Train faster rcnn model error when resume from tlt TAO Toolkit	4	577	October 12, 2021

tlt-train using DetectNet_V2 : getting 0 as average precision for each classes

Related topics