tlt-train using DetectNet_V2 : getting 0 as average precision for each classes

INFO:tensorflow:Saving checkpoints for step-19550.
2019-11-28 10:25:49,337 [INFO] tensorflow: Saving checkpoints for step-19550.
2019-11-28 10:25:50,819 [INFO] iva.detectnet_v2.evaluation.evaluation: step 0 / 97, 0.00s/step
2019-11-28 10:25:53,518 [INFO] iva.detectnet_v2.evaluation.evaluation: step 10 / 97, 0.27s/step
2019-11-28 10:25:56,181 [INFO] iva.detectnet_v2.evaluation.evaluation: step 20 / 97, 0.27s/step
2019-11-28 10:25:58,780 [INFO] iva.detectnet_v2.evaluation.evaluation: step 30 / 97, 0.26s/step
2019-11-28 10:26:01,464 [INFO] iva.detectnet_v2.evaluation.evaluation: step 40 / 97, 0.27s/step
2019-11-28 10:26:04,098 [INFO] iva.detectnet_v2.evaluation.evaluation: step 50 / 97, 0.26s/step
2019-11-28 10:26:06,764 [INFO] iva.detectnet_v2.evaluation.evaluation: step 60 / 97, 0.27s/step
2019-11-28 10:26:09,330 [INFO] iva.detectnet_v2.evaluation.evaluation: step 70 / 97, 0.26s/step
2019-11-28 10:26:11,954 [INFO] iva.detectnet_v2.evaluation.evaluation: step 80 / 97, 0.26s/step
2019-11-28 10:26:14,593 [INFO] iva.detectnet_v2.evaluation.evaluation: step 90 / 97, 0.26s/step
Epoch 50/120
=========================

Validation cost: -0.000009
Mean average_precision (in %): 0.0000

class name                    average precision (in %)
--------------------------  --------------------------
Cl                                   0
Fl                                   0
Ladders                              0
Plat                                 0
Stac                                 0
Stalls                               0
Sp                                   0

Median Inference Time: 0.065659
Epoch 55/120
=========================

Validation cost: -0.000009
Mean average_precision (in %): 0.0000

class name                    average precision (in %)
--------------------------  --------------------------
Cl                                  0
Fl                                  0
Ladders                             0
Plat                                0
Stac                                0
Stalls                              0
Sp                                  0

Please check the attached train config file

The images used for training have high resolution (4096 *2160)
label.txt will be like following

Fl 0.00 0.00 0 901.03808 635.71608 3158.048768 2160.0 0.00 0.00 0.00 0.00 0.00 0.00 0.00

augmentation_config {
preprocessing {
output_image_width: 768
output_image_height: 768
min_bbox_width: 2.0
min_bbox_height: 2.0
output_image_channel: 3
}
train_config.txt (9.2 KB)

Hi
I find several culprits.

  1. Your label.txt is not expected. See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#label_file , the sum of the total number of elements per object is 15.
    Do you generate tfrecord files successfully with “tlt-dataset-convert”?

  2. Your attached training config file does not exactly match what you mentioned.
    In your attachment,

output_image_width: 768
output_image_height: 768

Could you attach the correct config file?
Also, please see https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#augmentation_module for the setting.

  1. I find the class name in your config file do not match your mAP class name (Cl, Fl, etc)
target_classes {
    name: "xxx"

Please double check if the config file is the correct one.

Hello,

Actually mAP class file was the same in config file,( I just shorten the class name while asking this question, i meant through editing this question). And also the output_image_width: 768 output_image_height: 768 , both are same like in config file . Still I’m getting 0 value for average precision.

Note : I have edited the question . Even label.txt used for training have 15 fields. And the classes are in the same name in config file.

Hi samjith888,
The setting of output_image_width or output_image_height inside training config file should be exactly the same resolution of your training dataset.
Your mentioned that the images used for training have resolution (4096 *2160).
But your training config file set as below. It is not expected.

augmentation_config {
  preprocessing {
    output_image_width: 768
    output_image_height: 768

I’m getting the following error when i replace the augmentation config file with my input image resolution.

ResourceExhaustedError : OOM when allocating tensor with shape[4,64,1080,2048] and type float on /j…

OK, so please paste your full log along with the running command into your another topic https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/resourceexhaustederror-oom-when-allocating-tensor-with-shape-4-64-1080-2048-and-type-float-on-j-/

Let’s track the OOM issue in that link. Thanks.

I have pasted there. Please look it there

Are you sure about the output_image_width and height values have to be replaced with my original training image resolution (4096 *2160) ?
I have gone through your answer https://devtalk.nvidia.com/default/topic/1067151/transfer-learning-toolkit/understanding-parameters-of-training-config/post/5405248/#5405248 Where you have mentioned about the same field with resized image resolution.

Yes,if you resize the original dataset to 768768, it’s correct for you to set 768768 in the training spec.
But I saw your label file is still not changed accordingly. Its bbox needs resize too.

I didn’t want to resize , because its too difficult to make new label file for each images again.
Please help me out for the following issue
https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/-w-tensorflow-core-framework-op_kernel-cc-1401-op_requires-failed-at-pack_op-cc-88-resource-exhausted-oom-when-allocating-tensor-with-shape-32-3-2160-4096-and-type-float-on-job-localhost-replica-0-task-0-device-gpu-0-by-allocator-gpu_0_bfc/

OK, if you do not change label, it is a must to set 4096*2160 in training spec before training.

when i set 4096*2160 , i got resource exhausted error as i mentioned in the below link

https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/-w-tensorflow-core-framework-op_kernel-cc-1401-op_requires-failed-at-pack_op-cc-88-resource-exhausted-oom-when-allocating-tensor-with-shape-32-3-2160-4096-and-type-float-on-job-localhost-replica-0-task-0-device-gpu-0-by-allocator-gpu_0_bfc/

Let’s close this topic and check https://devtalk.nvidia.com/default/topic/1067405/transfer-learning-toolkit/-w-tensorflow-core-framework-op_kernel-cc-1401-op_requires-failed-at-pack_op-cc-88-resource-exhausted-oom-when-allocating-tensor-with-shape-32-3-2160-4096-and-type-float-on-job-localhost-replica-0-task-0-device-gpu-0-by-allocator-gpu_0_bfc/ instead.