mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18))

TheGreatCoder · January 30, 2024, 7:05am

My config convert tfcords and training:
config_train.txt (9.6 KB)
convert_data_val.txt (797 Bytes)
convert_data_train.txt (804 Bytes)
I’m training detectnet_v2 with dataset custom for 7 classes and mAP so low or even = 0.
{“face”: 2.5185, “person”: 6.9118, “bicycle”: 0.0, “car”: 1.9527, “motorcycle”: 0.0, “bus”: 0.0, “truck”: 0.0}
I’ll check and set-up my config but not work out. Please, help me. Thanks a lot!
The number of labels(7 classes) in train folder:
b’car’: 621051
b’person’: 703912
b’motorcycle’: 199312
b’bus’: 20613
b’face’: 360910
b’truck’: 47091
b’bicycle’: 13892

Morganh · January 30, 2024, 9:16am

Could you share the training log? Thanks.

TheGreatCoder · January 30, 2024, 9:32am

log_train.txt (22.0 KB)
I converted data kitty format to tfrecords with 10 part.
data_sources: {
tfrecords_path: “/detectnet_v2/dataset/tfrecord/train_v1/-fold-000-of-002-shard-00000-of-00010”
image_directory_path: “/detectnet_v2/dataset”
}
Should I change tfrecords_path: “/detectnet_v2/dataset/tfrecord/train_v1/-fold-000-of-002-shard-00000-of-00010” to “/detectnet_v2/dataset/tfrecord/train_v1/*” ?

Morganh · January 30, 2024, 9:56am

The log does not contain the info about how many images during training.

Yes, please include all the tfreocrds files. You can find the detailed log when you run training again.
More info can be found in tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt at main · NVIDIA/tao_tutorials · GitHub.

TheGreatCoder · January 30, 2024, 10:22am

I changed config training, and may be problems have solved (loss has been decreased). I’m appreciate for your help. Thank you so much!

TheGreatCoder · January 30, 2024, 2:20pm

I’m training with more than 300k images with labels of each class:
b’car’: 621051
b’person’: 703912
b’motorcycle’: 199312
b’bus’: 20613
b’face’: 360910
b’truck’: 47091
b’bicycle’: 13892
Can you check my config training and give me some change for my train?

Morganh · January 31, 2024, 3:03am

Do your training images have different resolution? If yes, can you estimate the average resolution of your training images?

TheGreatCoder · January 31, 2024, 3:17am

Yes, my training images have different resolution. Average resolution: 1156 x 678 (WxH) with 315058 images.

Morganh · January 31, 2024, 3:20am

Could you share the latest training spec file? And also can you share the full training log?

TheGreatCoder · January 31, 2024, 3:25am

experiment_spec.txt (9.5 KB)
training_log_v5.txt (7.1 KB)
My training is not full, but loss couldn’t decrease. I think with large dataset, at first epoch map of each class have to != 0. My config training is wrong?

Morganh · January 31, 2024, 3:31am

Do you have the log shown in the terminal? It has more detailed info.

TheGreatCoder · January 31, 2024, 3:36am

2024-01-31 03:16:41,320 [INFO] tensorflow: epoch = 6.538391224862888, learning_rate = 0.00021555979, loss = 8.305657e-05, step = 128754 (6.339 sec)
INFO:tensorflow:epoch = 6.5387466991671745, learning_rate = 0.00021560378, loss = 8.3158775e-05, step = 128761 (6.313 sec)
2024-01-31 03:16:47,633 [INFO] tensorflow: epoch = 6.5387466991671745, learning_rate = 0.00021560378, loss = 8.3158775e-05, step = 128761 (6.313 sec)
INFO:tensorflow:epoch = 6.539102173471461, learning_rate = 0.000215648, loss = 9.719213e-05, step = 128768 (6.228 sec)
2024-01-31 03:16:53,861 [INFO] tensorflow: epoch = 6.539102173471461, learning_rate = 0.000215648, loss = 9.719213e-05, step = 128768 (6.228 sec)
2024-01-31 03:16:59,251 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.825
INFO:tensorflow:epoch = 6.5394576477757465, learning_rate = 0.00021569223, loss = 0.00013159464, step = 128775 (6.282 sec)
2024-01-31 03:17:00,143 [INFO] tensorflow: epoch = 6.5394576477757465, learning_rate = 0.00021569223, loss = 0.00013159464, step = 128775 (6.282 sec)
INFO:tensorflow:epoch = 6.539813122080033, learning_rate = 0.00021573625, loss = 9.598676e-05, step = 128782 (6.343 sec)
2024-01-31 03:17:06,486 [INFO] tensorflow: epoch = 6.539813122080033, learning_rate = 0.00021573625, loss = 9.598676e-05, step = 128782 (6.343 sec)
INFO:tensorflow:epoch = 6.540168596384318, learning_rate = 0.0002157805, loss = 0.00014440104, step = 128789 (6.312 sec)
2024-01-31 03:17:12,798 [INFO] tensorflow: epoch = 6.540168596384318, learning_rate = 0.0002157805, loss = 0.00014440104, step = 128789 (6.312 sec)
INFO:tensorflow:epoch = 6.540524070688605, learning_rate = 0.00021582452, loss = 0.00011894806, step = 128796 (6.342 sec)
2024-01-31 03:17:19,140 [INFO] tensorflow: epoch = 6.540524070688605, learning_rate = 0.00021582452, loss = 0.00011894806, step = 128796 (6.342 sec)
2024-01-31 03:17:21,820 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.724
INFO:tensorflow:epoch = 6.54087954499289, learning_rate = 0.00021586879, loss = 0.00010047098, step = 128803 (6.320 sec)
2024-01-31 03:17:25,460 [INFO] tensorflow: epoch = 6.54087954499289, learning_rate = 0.00021586879, loss = 0.00010047098, step = 128803 (6.320 sec)
INFO:tensorflow:epoch = 6.541235019297177, learning_rate = 0.00021591285, loss = 9.457305e-05, step = 128810 (6.314 sec)
2024-01-31 03:17:31,773 [INFO] tensorflow: epoch = 6.541235019297177, learning_rate = 0.00021591285, loss = 9.457305e-05, step = 128810 (6.314 sec)
INFO:tensorflow:epoch = 6.541590493601462, learning_rate = 0.00021595712, loss = 8.963511e-05, step = 128817 (6.334 sec)
2024-01-31 03:17:38,107 [INFO] tensorflow: epoch = 6.541590493601462, learning_rate = 0.00021595712, loss = 8.963511e-05, step = 128817 (6.334 sec)
INFO:tensorflow:epoch = 6.541945967905749, learning_rate = 0.00021600141, loss = 0.00011299775, step = 128824 (6.643 sec)
2024-01-31 03:17:44,750 [INFO] tensorflow: epoch = 6.541945967905749, learning_rate = 0.00021600141, loss = 0.00011299775, step = 128824 (6.643 sec)
2024-01-31 03:17:44,751 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.445
INFO:tensorflow:epoch = 6.542301442210035, learning_rate = 0.0002160455, loss = 0.00011047623, step = 128831 (6.301 sec)
2024-01-31 03:17:51,051 [INFO] tensorflow: epoch = 6.542301442210035, learning_rate = 0.0002160455, loss = 0.00011047623, step = 128831 (6.301 sec)
INFO:tensorflow:epoch = 6.5426569165143205, learning_rate = 0.00021608958, loss = 9.89228e-05, step = 128838 (6.343 sec)
2024-01-31 03:17:57,394 [INFO] tensorflow: epoch = 6.5426569165143205, learning_rate = 0.00021608958, loss = 9.89228e-05, step = 128838 (6.343 sec)
INFO:tensorflow:epoch = 6.543012390818607, learning_rate = 0.0002161339, loss = 9.39452e-05, step = 128845 (6.297 sec)
2024-01-31 03:18:03,691 [INFO] tensorflow: epoch = 6.543012390818607, learning_rate = 0.0002161339, loss = 9.39452e-05, step = 128845 (6.297 sec)
2024-01-31 03:18:07,314 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.729
INFO:tensorflow:epoch = 6.543367865122892, learning_rate = 0.00021617822, loss = 0.00011528027, step = 128852 (6.323 sec)
2024-01-31 03:18:10,014 [INFO] tensorflow: epoch = 6.543367865122892, learning_rate = 0.00021617822, loss = 0.00011528027, step = 128852 (6.323 sec)
INFO:tensorflow:epoch = 6.543723339427179, learning_rate = 0.00021622235, loss = 0.00011065371, step = 128859 (6.334 sec)
2024-01-31 03:18:16,348 [INFO] tensorflow: epoch = 6.543723339427179, learning_rate = 0.00021622235, loss = 0.00011065371, step = 128859 (6.334 sec)
INFO:tensorflow:epoch = 6.544078813731464, learning_rate = 0.0002162667, loss = 7.3142364e-05, step = 128866 (6.268 sec)
2024-01-31 03:18:22,616 [INFO] tensorflow: epoch = 6.544078813731464, learning_rate = 0.0002162667, loss = 7.3142364e-05, step = 128866 (6.268 sec)
I just could copy a part of terminal’s log.

Morganh · January 31, 2024, 3:38am

Can you share some log in the beginning? You can upload it as a .txt file.
The log will show how many images are training.

TheGreatCoder · January 31, 2024, 3:42am

I’m training with nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 docker. My log training model show just is:
training_log_v5.txt (7.1 KB)

TheGreatCoder · January 31, 2024, 3:46am

Is my problem with my docker version?

Morganh · January 31, 2024, 3:50am

Please use latest one for tf1.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash

BTW, to avoid error, please set dbscan_min_samples: 1 . Refer to tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt at main · NVIDIA/tao_tutorials · GitHub.

TheGreatCoder · January 31, 2024, 4:41am

I’m trying this docker latest, can you give me some change in my config training above to improve mAP?

Morganh · January 31, 2024, 4:54am

You can try

Change to output_image_width: 1152 and output_image_height: 672
Delete all_projections: true
Delete freeze_blocks: 0.0 and freeze_blocks: 1.0
Change class_weight for different classes. It can set lower if the corresponding class has relatively larger dataset. It can set larger if the corresponding class has relatively smaller dataset.
Also please check if the objects are small. If yes, please refer to " In DetectNet_V2, are there any parameters that can help improve AP (average precision) on training small objects? " in Frequently Asked Questions - NVIDIA Docs.

TheGreatCoder · January 31, 2024, 6:45am

Export checkpoint is .hdf5, how to convert to tlt to inference?

Morganh · January 31, 2024, 6:47am

Please use the .hdf5 instead as pretrained model.

Topic		Replies	Views
One class missing from tfrecords- Training stops with mAP equal to 0 TAO Toolkit	8	587	April 3, 2022
Getting 0 mAP for detectnet_v2 model over 150 epochs TAO Toolkit	14	55	January 11, 2025
TLT-Train DetectNetv2 ResNet18 always give mAP 0% at target class TAO Toolkit	11	483	October 26, 2022
During training, the mAP value becomes 0 TAO Toolkit	2	745	October 12, 2021
Calculate mAP of tlt using custom dataset TAO Toolkit	15	840	October 3, 2021
Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases TAO Toolkit	19	762	May 4, 2024
Mean average precision of 0.00 in training Trafficcamnet model using Tao Toolkit TAO Toolkit deepstream	25	29	January 13, 2025
BodyPoseNet trained with custom dataset not detecting TAO Toolkit	21	849	June 6, 2022
Faster RCNN ResNet-101 Problems TAO Toolkit	20	1098	October 12, 2021
Training Custom FasterRCNN resnet50 Object detection issue TAO Toolkit	9	1116	October 12, 2021

mAP training several classes = 0.0 and so low with data custom using detectnet_v2 (resnet_18))

Related topics