Getting 0 mAP for detectnet_v2 model over 150 epochs

I am trying to train FLIR thermal dataset over detectnet_v2 model but even after 150 epochs the result is 0 mAP
these are my system specifications:

Device : Ubuntu 22.04 CUDA 12.4 GPU Geforce RTX 4090 251.5GB
Image resolution: 640x512

Sample Label file:
person 0 0 0 32.0 229.0 54.0 284.0 0 0 0 0 0 0 0
car 0 0 0 174.0 225.0 213.0 255.0 0 0 0 0 0 0 0
car 0 0 0 218.0 221.0 243.0 243.0 0 0 0 0 0 0 0

Theses are my tfrecords:
2024-12-17 04:39:25,427 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 1
2024-12-17 04:39:25,504 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 2
2024-12-17 04:39:25,580 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 3
2024-12-17 04:39:25,652 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 4
2024-12-17 04:39:25,730 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 5
2024-12-17 04:39:25,808 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 6
2024-12-17 04:39:25,884 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 7
2024-12-17 04:39:25,964 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 8
2024-12-17 04:39:26,049 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 0, shard 9
2024-12-17 04:39:26,129 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 250:
Wrote the following numbers of objects:
b’car’: 15148
b’person’: 5002
b’bicycle’: 2624
b’dog’: 20

2024-12-17 04:39:26,129 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 0
2024-12-17 04:39:26,449 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 1
2024-12-17 04:39:26,766 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 2
2024-12-17 04:39:27,090 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 3
2024-12-17 04:39:27,407 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 4
2024-12-17 04:39:27,720 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 5
2024-12-17 04:39:28,041 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 6
2024-12-17 04:39:28,368 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 7
2024-12-17 04:39:28,688 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 8
2024-12-17 04:39:28,999 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 166: Writing partition 1, shard 9
2024-12-17 04:39:29,310 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 250:
Wrote the following numbers of objects:
b’car’: 89038
b’person’: 51488
b’bicycle’: 7168
b’dog’: 522

2024-12-17 04:39:29,310 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 89: Cumulative object statistics
2024-12-17 04:39:29,310 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 250:
Wrote the following numbers of objects:
b’car’: 104186
b’person’: 56490
b’bicycle’: 9792
b’dog’: 542

2024-12-17 04:39:29,310 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 105: Class map.
Label in GT: Label in tfrecords file
b’car’: b’car’
b’person’: b’person’
b’bicycle’: b’bicycle’
b’dog’: b’dog’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2024-12-17 04:39:29,310 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.dataio.dataset_converter_lib 114: Tfrecords generation complete.
Execution status: PASS
2024-12-17 10:09:35,834 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

This my training spec file:
I am just training for one class here, i also tried training for all of them but the results were the same

random_seed: 42

dataset_config {
data_sources {
tfrecords_path: “/datasets/infrared/train_80/tfrecords/*”
image_directory_path: “/datasets/infrared/train_80”
}
image_extension: “jpg”

target_class_mapping {
key: “person”
value: “person”
}

#validation_fold: 0
validation_data_source {
tfrecords_path: “/datasets/infrared/test_set/tfrecords/*”
image_directory_path: “/datasets/infrared/test_set”
}
}

augmentation_config {
preprocessing {
output_image_width: 640
output_image_height: 512
crop_right: 640
crop_bottom: 512
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}

spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}

color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}

postprocessing_config {
target_class_config {
key: “person”
value {
clustering_config {
coverage_threshold: 0.00499999988824
dbscan_eps: 0.20000000298
dbscan_min_samples: 1
minimum_bounding_box_height: 4
}
}
}
}

model_config {
pretrained_model_file: “/workspace/tao-experiments/detectnet_v2/pretrained_resnet18/pretrained_detectnet_v2_vresnet18/resnet18.hdf5”
num_layers: 18
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
arch: “resnet”
}
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}

training_precision {
backend_floatx: FLOAT32
}
arch: “resnet”
}

evaluation_config {
validation_period_during_training: 5
first_validation_epoch: 5
minimum_detection_ground_truth_overlap {
key: “person”
value: 0.5
}

evaluation_box_config {
key: “person”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}

average_precision_mode: INTEGRATE
}

cost_function_config {
target_classes {
name: “person”
class_weight: 1.0
coverage_foreground_weight: 0.05

objectives {
  name: "cov"
  initial_weight: 1.0
  weight_target: 1.0
}

objectives {
  name: "bbox"
  initial_weight: 10.0
  weight_target: 10.0
}

}

enable_autoweighting: true
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}

training_config {
batch_size_per_gpu: 32
num_epochs: 150

learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 1e-05
max_learning_rate: 1e-03 .
soft_start: 0.0
annealing: 0.2
}
}

regularizer {
type: L1
weight: 1e-5
}

optimizer {
adam {
epsilon: 9.99999993923e-09
beta1: 0.899999976158
beta2: 0.999000012875
}
}

cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}

checkpoint_interval: 10
}

bbox_rasterizer_config {
target_class_config {
key: “person”
value {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.4
}

Training log:
Epoch 150/150

Validation cost: 0.000789
Mean average_precision (in %): 0.0000

±-----------±-------------------------+
| class name | average precision (in %) |
±-----------±-------------------------+
| person | 0.0 |
±-----------±-------------------------+

Median Inference Time: 0.002862
2024-12-16 16:25:47,580 [TAO Toolkit] [INFO] root 2102: Evaluation metrics generated.
2024-12-16 16:25:47,580 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 39.115
2024-12-16 16:25:47,580 [TAO Toolkit] [INFO] root 2102: Training loop completed.
2024-12-16 16:25:47,580 [TAO Toolkit] [INFO] root 2102: Saving trained model.
2024-12-16 16:25:47,777 [TAO Toolkit] [INFO] root 2102: Model saved.

What is the issue exactly?

Is it AP=0 for all epochs? Please share the full log. You can upload as a txt file.
You can set lower batch-size and retry.
Also, please check if you can run notebook successfully. tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/detectnet_v2.ipynb at main · NVIDIA/tao_tutorials · GitHub.

Currently i don’t have the log file, but yes it was AP=0 for all epochs.
Okay i will retry with lower batch size.
Yes the notebook runs successfully.

Yes, please. And please monitor if the loss keeps decreasing.

I kept all the configurations same as mentioned above and started training for 300 epochs, and from 200th epoch i started getting AP results!

But when i tried training this Visdrone dataset I faced the same issue even though my batch size is 8.

Device : Ubuntu 22.04 CUDA 12.4 GPU Geforce RTX 4090 251.5GB
Image Resolution: 1920x1080.

training_spec.txt (7.5 KB)
training_log.txt (4.1 MB)
tfrecords_log.txt (8.5 KB)

Could you please pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 docker for training again? There is regression issue for detectnet_v2 in 5.0.1 docker.
$ docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5
Then inside the docker, run training as below.
$ docker run --runtime=nvidia -it --rm -d --name 4.0.1-docker -v /localhome/morganh:/localhome/morganh nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash
$ docker exec -it 4.0.1-docker /bin/bash
# detectnet_v2 train xxx

ok i will try again with this image:
$ docker pull nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

May I know is it training well in 4.0.1 docker now?

For 5.0 docker, if possible, can you change training spec to below and retry?
output_image_width: 1920
output_image_height: 1080
enable_auto_resize: true

yes it trained well on 4.0.1 docker.
I will try your suggestion too.

1 Like

2024-12-24 09:38:52,044 [TAO Toolkit] [INFO] root 2102: Starting DetectNet_v2 Training job
2024-12-24 09:38:52,044 [TAO Toolkit] [INFO] main 817: Loading experiment spec at /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2024-12-24 09:38:52,045 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.spec_handler.spec_loader 113: Merging specification from /workspace/tao-experiments/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
2024-12-24 09:38:52,049 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.spec_handler.spec_loader 78: Spec file validation failed.
Experiment Spec Setting Error: output_image_height should % 16. Wrong value: 1080
2024-12-24 09:38:52,049 [TAO Toolkit] [INFO] main 1032: Training was interrupted.
2024-12-24 09:38:52,049 [TAO Toolkit] [INFO] root 2102: Training was interrupted
Time taken to run main:main: 0:00:00.353003.
Execution status: PASS
2024-12-24 15:09:14,677 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

This error occurred.

Could you please set to output_image_height: 1088 and retry?

okay, i will try.

docker run --runtime=nvidia -it --rm -d --name 4.0.1-docker -v /localhome/morganh:/localhome/morganh nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 /bin/bash

is not working for me there is no nvidia in --runetime
error : docker: Error response from daemon: unknown or invalid runtime name: nvidia.

Please install nvidia-docker.

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd