My config convert tfcords and training:
config_train.txt (9.6 KB)
convert_data_val.txt (797 Bytes)
convert_data_train.txt (804 Bytes)
I’m training detectnet_v2 with dataset custom for 7 classes and mAP so low or even = 0.
{“face”: 2.5185, “person”: 6.9118, “bicycle”: 0.0, “car”: 1.9527, “motorcycle”: 0.0, “bus”: 0.0, “truck”: 0.0}
I’ll check and set-up my config but not work out. Please, help me. Thanks a lot!
The number of labels(7 classes) in train folder:
b’car’: 621051
b’person’: 703912
b’motorcycle’: 199312
b’bus’: 20613
b’face’: 360910
b’truck’: 47091
b’bicycle’: 13892
Could you share the training log? Thanks.
log_train.txt (22.0 KB)
I converted data kitty format to tfrecords with 10 part.
data_sources: {
tfrecords_path: “/detectnet_v2/dataset/tfrecord/train_v1/-fold-000-of-002-shard-00000-of-00010”
image_directory_path: “/detectnet_v2/dataset”
}
Should I change tfrecords_path: “/detectnet_v2/dataset/tfrecord/train_v1/-fold-000-of-002-shard-00000-of-00010” to “/detectnet_v2/dataset/tfrecord/train_v1/*” ?
The log does not contain the info about how many images during training.
Yes, please include all the tfreocrds files. You can find the detailed log when you run training again.
More info can be found in tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt at main · NVIDIA/tao_tutorials · GitHub.
I changed config training, and may be problems have solved (loss has been decreased). I’m appreciate for your help. Thank you so much!
I’m training with more than 300k images with labels of each class:
b’car’: 621051
b’person’: 703912
b’motorcycle’: 199312
b’bus’: 20613
b’face’: 360910
b’truck’: 47091
b’bicycle’: 13892
Can you check my config training and give me some change for my train?
Do your training images have different resolution? If yes, can you estimate the average resolution of your training images?
Yes, my training images have different resolution. Average resolution: 1156 x 678 (WxH) with 315058 images.
Could you share the latest training spec file? And also can you share the full training log?
experiment_spec.txt (9.5 KB)
training_log_v5.txt (7.1 KB)
My training is not full, but loss couldn’t decrease. I think with large dataset, at first epoch map of each class have to != 0. My config training is wrong?
Do you have the log shown in the terminal? It has more detailed info.
2024-01-31 03:16:41,320 [INFO] tensorflow: epoch = 6.538391224862888, learning_rate = 0.00021555979, loss = 8.305657e-05, step = 128754 (6.339 sec)
INFO:tensorflow:epoch = 6.5387466991671745, learning_rate = 0.00021560378, loss = 8.3158775e-05, step = 128761 (6.313 sec)
2024-01-31 03:16:47,633 [INFO] tensorflow: epoch = 6.5387466991671745, learning_rate = 0.00021560378, loss = 8.3158775e-05, step = 128761 (6.313 sec)
INFO:tensorflow:epoch = 6.539102173471461, learning_rate = 0.000215648, loss = 9.719213e-05, step = 128768 (6.228 sec)
2024-01-31 03:16:53,861 [INFO] tensorflow: epoch = 6.539102173471461, learning_rate = 0.000215648, loss = 9.719213e-05, step = 128768 (6.228 sec)
2024-01-31 03:16:59,251 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.825
INFO:tensorflow:epoch = 6.5394576477757465, learning_rate = 0.00021569223, loss = 0.00013159464, step = 128775 (6.282 sec)
2024-01-31 03:17:00,143 [INFO] tensorflow: epoch = 6.5394576477757465, learning_rate = 0.00021569223, loss = 0.00013159464, step = 128775 (6.282 sec)
INFO:tensorflow:epoch = 6.539813122080033, learning_rate = 0.00021573625, loss = 9.598676e-05, step = 128782 (6.343 sec)
2024-01-31 03:17:06,486 [INFO] tensorflow: epoch = 6.539813122080033, learning_rate = 0.00021573625, loss = 9.598676e-05, step = 128782 (6.343 sec)
INFO:tensorflow:epoch = 6.540168596384318, learning_rate = 0.0002157805, loss = 0.00014440104, step = 128789 (6.312 sec)
2024-01-31 03:17:12,798 [INFO] tensorflow: epoch = 6.540168596384318, learning_rate = 0.0002157805, loss = 0.00014440104, step = 128789 (6.312 sec)
INFO:tensorflow:epoch = 6.540524070688605, learning_rate = 0.00021582452, loss = 0.00011894806, step = 128796 (6.342 sec)
2024-01-31 03:17:19,140 [INFO] tensorflow: epoch = 6.540524070688605, learning_rate = 0.00021582452, loss = 0.00011894806, step = 128796 (6.342 sec)
2024-01-31 03:17:21,820 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.724
INFO:tensorflow:epoch = 6.54087954499289, learning_rate = 0.00021586879, loss = 0.00010047098, step = 128803 (6.320 sec)
2024-01-31 03:17:25,460 [INFO] tensorflow: epoch = 6.54087954499289, learning_rate = 0.00021586879, loss = 0.00010047098, step = 128803 (6.320 sec)
INFO:tensorflow:epoch = 6.541235019297177, learning_rate = 0.00021591285, loss = 9.457305e-05, step = 128810 (6.314 sec)
2024-01-31 03:17:31,773 [INFO] tensorflow: epoch = 6.541235019297177, learning_rate = 0.00021591285, loss = 9.457305e-05, step = 128810 (6.314 sec)
INFO:tensorflow:epoch = 6.541590493601462, learning_rate = 0.00021595712, loss = 8.963511e-05, step = 128817 (6.334 sec)
2024-01-31 03:17:38,107 [INFO] tensorflow: epoch = 6.541590493601462, learning_rate = 0.00021595712, loss = 8.963511e-05, step = 128817 (6.334 sec)
INFO:tensorflow:epoch = 6.541945967905749, learning_rate = 0.00021600141, loss = 0.00011299775, step = 128824 (6.643 sec)
2024-01-31 03:17:44,750 [INFO] tensorflow: epoch = 6.541945967905749, learning_rate = 0.00021600141, loss = 0.00011299775, step = 128824 (6.643 sec)
2024-01-31 03:17:44,751 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.445
INFO:tensorflow:epoch = 6.542301442210035, learning_rate = 0.0002160455, loss = 0.00011047623, step = 128831 (6.301 sec)
2024-01-31 03:17:51,051 [INFO] tensorflow: epoch = 6.542301442210035, learning_rate = 0.0002160455, loss = 0.00011047623, step = 128831 (6.301 sec)
INFO:tensorflow:epoch = 6.5426569165143205, learning_rate = 0.00021608958, loss = 9.89228e-05, step = 128838 (6.343 sec)
2024-01-31 03:17:57,394 [INFO] tensorflow: epoch = 6.5426569165143205, learning_rate = 0.00021608958, loss = 9.89228e-05, step = 128838 (6.343 sec)
INFO:tensorflow:epoch = 6.543012390818607, learning_rate = 0.0002161339, loss = 9.39452e-05, step = 128845 (6.297 sec)
2024-01-31 03:18:03,691 [INFO] tensorflow: epoch = 6.543012390818607, learning_rate = 0.0002161339, loss = 9.39452e-05, step = 128845 (6.297 sec)
2024-01-31 03:18:07,314 [INFO] modulus.hooks.sample_counter_hook: Train Samples / sec: 17.729
INFO:tensorflow:epoch = 6.543367865122892, learning_rate = 0.00021617822, loss = 0.00011528027, step = 128852 (6.323 sec)
2024-01-31 03:18:10,014 [INFO] tensorflow: epoch = 6.543367865122892, learning_rate = 0.00021617822, loss = 0.00011528027, step = 128852 (6.323 sec)
INFO:tensorflow:epoch = 6.543723339427179, learning_rate = 0.00021622235, loss = 0.00011065371, step = 128859 (6.334 sec)
2024-01-31 03:18:16,348 [INFO] tensorflow: epoch = 6.543723339427179, learning_rate = 0.00021622235, loss = 0.00011065371, step = 128859 (6.334 sec)
INFO:tensorflow:epoch = 6.544078813731464, learning_rate = 0.0002162667, loss = 7.3142364e-05, step = 128866 (6.268 sec)
2024-01-31 03:18:22,616 [INFO] tensorflow: epoch = 6.544078813731464, learning_rate = 0.0002162667, loss = 7.3142364e-05, step = 128866 (6.268 sec)
I just could copy a part of terminal’s log.
Can you share some log in the beginning? You can upload it as a .txt file.
The log will show how many images are training.
I’m training with nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3 docker. My log training model show just is:
training_log_v5.txt (7.1 KB)
Is my problem with my docker version?
Please use latest one for tf1.
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
BTW, to avoid error, please set dbscan_min_samples: 1
. Refer to tao_tutorials/notebooks/tao_launcher_starter_kit/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt at main · NVIDIA/tao_tutorials · GitHub.
I’m trying this docker latest, can you give me some change in my config training above to improve mAP?
You can try
- Change to
output_image_width: 1152
andoutput_image_height: 672
- Delete
all_projections: true
- Delete
freeze_blocks: 0.0
andfreeze_blocks: 1.0
- Change
class_weight
for different classes. It can set lower if the corresponding class has relatively larger dataset. It can set larger if the corresponding class has relatively smaller dataset. - Also please check if the objects are small. If yes, please refer to " In DetectNet_V2, are there any parameters that can help improve AP (average precision) on training small objects? " in Frequently Asked Questions - NVIDIA Docs.
Export checkpoint is .hdf5, how to convert to tlt to inference?
Please use the .hdf5 instead as pretrained model.