Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases

I tried to follow Jupyter Notebook included in GitHub of retraining PeopleNet to adapt IR images.

But the result showed 0.0 mAP%.

Could you give me any instructions to accomplish around 80 mAP% as bellow?

I attached Notebook and spec files which I used.

Those are modified points when I executed the Notebook.

To change tao commands to correspond TAO version 5.
To change the repository name where I downloaded pretrained model from.
To change the value of “dbscan_min_samples” to “1” in the training spec file due to bellow information.

And, I only executed training process with 80% training dataset.

Best regards,
tao-v5_peoplenet_IR.zip (294.4 KB)
training_spec.txt (3.2 KB)
tfrecord_spec.txt (264 Bytes)

Could you set enable_auto_resize: true and retry? Refer to DetectNet_v2 - NVIDIA Docs.

Thank you for your reply.

Is it correct bellow?
I modified training_spec.txt.

augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
enable_auto_resize: true
}

The error message was appeared and interrupted training process with the modified spec file.

google.protobuf.text_format.ParseError: 42:3 : Message type “AugmentationConfig” has no field named “enable_auto_resize”.
Execution status: FAIL

Please
preprocessing {
output_image_width: 960
output_image_height: 544
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
enable_auto_resize: true
}

Thank you for your reply.

It worked, but mAP% showed 0.0153.
It’s still too low.
Do you have another instruction?

Best regards,

tao-v5_peoplenet_IR_20240405.zip (295.1 KB)

Will try on my side.
Could you try below?

  batch_size_per_gpu: 4
  num_epochs: 60
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-07
      max_learning_rate: 5e-05
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }

Thank you for your kindly support.

Will try on my side.

Thanks a lot. I beg you your best solution.

BTW, I tried with suggested values.
Then, I got improvement mAP 45.56 %.

I’m wondering if you could explain us why come to this improvement with those values?
I want to achieve more higher mAP to close 80%.
So I’d l understand these values and get tips to improve retraining.

Best regards,
tao-v5_peoplenet_IR_20240405_2.zip (328.9 KB)

I believe the github is training from old TAO version. So, I suggest you to use TAO 4.0.1 version from TAO Toolkit | NVIDIA NGC instead.
$ docker run --runtime=nvidia -it --rm -v /your/local/folder:/docker/folder nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5
Then run the training.
# detectnet_v2 train xxx
It should be working to get similar result.

Glad to know it gets improved in Tao 5. The soul is to tune the parameters to help decrease the training loss which is shown during training. For TAO 5.0, you can tune the bs, min_learning_rate, max_learning_rate and start_point, annealing_point.

Thank you for your kindly support.

I tried retraining PeopleNet with TAO 4.0.1 container.
And I achieved mAP 90 %.
I used 80% datasets of FLIR 1.3, and did pruning and retraining after that.

But, what causes such radical difference?
I think TAO v5 is added some valid function and modification some commands.
I wonder if there are a lot of internal changes between TAO v4 and v5.

Do you have any opinion about that?

Best regards,

Thanks for the info. So seems that the 4.0.1 result can match the blog/github.
For TAO5, could you share the docker name? Is it nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5?

I just followed “Launcher CLI” of “TAO Toolkit Quick Start Guide” when I tried TAO5.
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html

My TAO5 environment was built by ~/tao-getting-started_v5.3.0$ bash setup/quickstart_launcher.sh.

I attached quickstart_launcher.sh and the log of execute that script.

tao_v5_installed.txt (16.5 KB)
quickstart_launcher.zip (2.1 KB)

It seems that the script built the hierarchy as below.

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:

5.0.0-tf1.15.5:

5.3.0-pyt:

dataset:
dockers:
nvidia/tao/tao-toolkit:
5.3.0-data-services:

deploy:
dockers:
nvidia/tao/tao-toolkit:
5.3.0-deploy:

Best regards,

Thanks for the info. Will check the gap between TAO v4 and v5…

Please share your findings. I have similar issues with my own projects when training detectnet_v2 on TAO4 and TAO5. The accuracy on TAO5 is lower, and I cannot explain why.

As as workaround, please use TAO4.0.1 docker to train detectnet_v2 network.
$ docker run --runtime=nvidia -it --rm -v /your/local/folder:/docker/folder nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

Thank you @Morganh

The reason why I changed from TAO4 to TAO5 is that training with 2 GPUs is crashing with TAO4, but works with TAO5.

See error message below when I start the docker for TAO4. I start the docker for training with:

docker run -it --rm --runtime nvidia \
        -v $CWD:/dli/task \
        nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 \
        detectnet_v2 train \
        -e /dli/task/spec_files/combined_training_config.txt \
        -r /dli/task/tao_project/models/resnet18_detector \
        -k tlt_encode \
        -n resnet18_detector \
        --gpus 2

Error message:

[b8d39df045cf:187  :0:356]      cma_ep.c:81   process_vm_writev(pid=188 {0x72c209a02200,37632}-->{0x77f534b5eb00,37632}) returned -1: Operation not permitted
==== backtrace (tid:    356) ====                                                                                                                                                                                0 0x00000000000039f2 uct_cma_ep_tx_error()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81                                     1 0x0000000000003d66 uct_cma_ep_tx()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
 2 0x000000000001e209 uct_scopy_ep_progress_tx()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151                            3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321                           4 0x000000000001dcf1 ucs_arbiter_dispatch()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386                                    5 0x0000000000052467 ucs_callbackq_slow_proxy()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
 6 0x000000000004be9a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
 7 0x000000000004be9a uct_worker_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
 8 0x000000000004be9a ucp_worker_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
 9 0x0000000000037144 opal_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:
234
12 0x00000000000929d3 ompi_coll_base_bcast_intra_generic()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/
coll_base_bcast.c:98
13 0x0000000000092cc2 ompi_coll_base_bcast_intra_bintree()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/
coll_base_bcast.c:272
14 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tu
ned/coll_tuned_decision_fixed.c:649
15 0x000000000006cc11 PMPI_Bcast()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
16 0x000000000006cc11 PMPI_Bcast()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
17 0x00000000001055c2 horovod::common::MPIBroadcast::Execute()  /opt/horovod/horovod/common/ops/mpi_operations.cc:395
18 0x00000000001055c2 horovod::common::TensorShape::~TensorShape()  /opt/horovod/horovod/common/ops/../common.h:234
19 0x00000000001055c2 horovod::common::MPIBroadcast::Execute()  /opt/horovod/horovod/common/ops/mpi_operations.cc:396
20 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast()  /opt/horovod/horovod/common/ops/operation_manager.cc:66
21 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:116
22 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297                                                                   [364/9296]23 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
24 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
25 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
26 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
27 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
28 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
29 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
30 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
31 0x0000000000008609 start_thread()  ???:0
32 0x000000000011f133 clone()  ???:0
=================================
[b8d39df045cf:00187] *** Process received signal ***
[b8d39df045cf:00187] Signal: Aborted (6)
[b8d39df045cf:00187] Signal code:  (-6)
[b8d39df045cf:00187] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x72c747dc2090]
[b8d39df045cf:00187] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x72c747dc200b]
[b8d39df045cf:00187] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x72c747da1859]
[b8d39df045cf:00187] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x72c713fb57dd]
[b8d39df045cf:00187] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x72c713fbadc2]
[b8d39df045cf:00187] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x72c713fbb194]
[b8d39df045cf:00187] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x72c7141c59f2]
[b8d39df045cf:00187] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x72c7141c5d66]
[b8d39df045cf:00187] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x72c713f3a209]
[b8d39df045cf:00187] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x72c713fac6d6]
[b8d39df045cf:00187] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x72c713f39cf1]
[b8d39df045cf:00187] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x72c713fad467]
[b8d39df045cf:00187] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x72c71412de9a]
[b8d39df045cf:00187] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x72c716dcb144]
[b8d39df045cf:00187] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x72c716dd1c05]
[b8d39df045cf:00187] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x72c716fb8fba]
[b8d39df045cf:00187] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x503)[0x72c716ff59d3]
[b8d39df045cf:00187] [17] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x72c716ff5cc2]
[b8d39df045cf:00187] [18] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x72c713c54840]
[b8d39df045cf:00187] [19] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x72c716fcfc11]
[b8d39df045cf:00187] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EER
KNS0_8ResponseE+0x3e2)[0x72c70e1df5c2]
[b8d39df045cf:00187] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTable
EntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x72c70e1b452d]
[b8d39df045cf:00187] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTable
EntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x72c70e1b4901]
[b8d39df045cf:00187] [23] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x72c70e18302d]
[b8d39df045cf:00187] [24] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x72c74712cde4]
[b8d39df045cf:00187] [24] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x72c74712cde4]
[b8d39df045cf:00187] [25] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x72c747d64609]
[b8d39df045cf:00187] [26] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x72c747e9e133]
[b8d39df045cf:00187] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node b8d39df045cf exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Execution status: FAIL

The output of nvidia-smi:

$ nvidia-smi
Tue Apr 16 20:37:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:17:00.0 Off |                  Off |
| 66%   84C    P2             212W / 230W |   6743MiB / 24564MiB |     83%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               Off | 00000000:65:00.0 Off |                  Off |
| 30%   50C    P8              18W / 230W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(Computing at the moment on the first gpu.)

@blakec ,
For TAO5, it is needed to tune parameters. For example, there is an improvement mentioned in Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases - #8 by Morganh. Could you try on that?

@Morganh I did that, but the above mentioned crash happens during startup in TAO4 with 2 gpus, before it starts computing. I don’t think it has to do with hyperparameters. But don’t worry. I’ll keep it on single gpu for now.

Hi, @ Morganh

Thanks for the info. Will check the gap between TAO v4 and v5…

Do you have any update on this?

Best regards,

Masaki Yamagishi / Ryoyo

Hi,
The DetectNet_v2 will port to TensorFlow 2.0. TAO team is working on that and it may be available in future release.