Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases

masaki_yamagishi · April 4, 2024, 6:58am

I tried to follow Jupyter Notebook included in GitHub of retraining PeopleNet to adapt IR images.

But the result showed 0.0 mAP%.

Could you give me any instructions to accomplish around 80 mAP% as bellow?

I attached Notebook and spec files which I used.

Those are modified points when I executed the Notebook.

To change tao commands to correspond TAO version 5.
To change the repository name where I downloaded pretrained model from.
To change the value of “dbscan_min_samples” to “1” in the training spec file due to bellow information.

And, I only executed training process with 80% training dataset.

Best regards,
tao-v5_peoplenet_IR.zip (294.4 KB)
training_spec.txt (3.2 KB)
tfrecord_spec.txt (264 Bytes)

Morganh · April 4, 2024, 5:52pm

Could you set enable_auto_resize: true and retry? Refer to DetectNet_v2 - NVIDIA Docs.

masaki_yamagishi · April 4, 2024, 10:25pm

Thank you for your reply.

Is it correct bellow?
I modified training_spec.txt.

augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
enable_auto_resize: true
}

The error message was appeared and interrupted training process with the modified spec file.

google.protobuf.text_format.ParseError: 42:3 : Message type “AugmentationConfig” has no field named “enable_auto_resize”.
Execution status: FAIL

Morganh · April 5, 2024, 2:29am

Please
preprocessing {
output_image_width: 960
output_image_height: 544
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
output_image_channel: 3
enable_auto_resize: true
}

masaki_yamagishi · April 5, 2024, 6:44am

Thank you for your reply.

It worked, but mAP% showed 0.0153.
It’s still too low.
Do you have another instruction?

Best regards,

tao-v5_peoplenet_IR_20240405.zip (295.1 KB)

Morganh · April 5, 2024, 7:42am

Will try on my side.
Could you try below?

  batch_size_per_gpu: 4
  num_epochs: 60
  learning_rate {
    soft_start_annealing_schedule {
      min_learning_rate: 5e-07
      max_learning_rate: 5e-05
      soft_start: 0.10000000149
      annealing: 0.699999988079
    }
  }

masaki_yamagishi · April 5, 2024, 11:13am

Thank you for your kindly support.

Will try on my side.

Thanks a lot. I beg you your best solution.

BTW, I tried with suggested values.
Then, I got improvement mAP 45.56 %.

I’m wondering if you could explain us why come to this improvement with those values?
I want to achieve more higher mAP to close 80%.
So I’d l understand these values and get tips to improve retraining.

Best regards,
tao-v5_peoplenet_IR_20240405_2.zip (328.9 KB)

Morganh · April 5, 2024, 2:46pm

I believe the github is training from old TAO version. So, I suggest you to use TAO 4.0.1 version from TAO Toolkit | NVIDIA NGC instead.
$ docker run --runtime=nvidia -it --rm -v /your/local/folder:/docker/folder nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5
Then run the training.
# detectnet_v2 train xxx
It should be working to get similar result.

Glad to know it gets improved in Tao 5. The soul is to tune the parameters to help decrease the training loss which is shown during training. For TAO 5.0, you can tune the bs, min_learning_rate, max_learning_rate and start_point, annealing_point.

masaki_yamagishi · April 11, 2024, 10:41am

Thank you for your kindly support.

I tried retraining PeopleNet with TAO 4.0.1 container.
And I achieved mAP 90 %.
I used 80% datasets of FLIR 1.3, and did pruning and retraining after that.

But, what causes such radical difference?
I think TAO v5 is added some valid function and modification some commands.
I wonder if there are a lot of internal changes between TAO v4 and v5.

Do you have any opinion about that?

Best regards,

Morganh · April 12, 2024, 2:11am

Thanks for the info. So seems that the 4.0.1 result can match the blog/github.
For TAO5, could you share the docker name? Is it nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5?

masaki_yamagishi · April 12, 2024, 11:52am

I just followed “Launcher CLI” of “TAO Toolkit Quick Start Guide” when I tried TAO5.
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html

My TAO5 environment was built by ~/tao-getting-started_v5.3.0$ bash setup/quickstart_launcher.sh.

I attached quickstart_launcher.sh and the log of execute that script.

tao_v5_installed.txt (16.5 KB)
quickstart_launcher.zip (2.1 KB)

It seems that the script built the hierarchy as below.

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.0.0-tf2.11.0:
…
5.0.0-tf1.15.5:
…
5.3.0-pyt:
…
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.3.0-data-services:
…
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.3.0-deploy:
…

Best regards,

Morganh · April 15, 2024, 2:52am

Thanks for the info. Will check the gap between TAO v4 and v5…

blakec · April 16, 2024, 11:21am

Please share your findings. I have similar issues with my own projects when training detectnet_v2 on TAO4 and TAO5. The accuracy on TAO5 is lower, and I cannot explain why.

Morganh · April 16, 2024, 2:16pm

As as workaround, please use TAO4.0.1 docker to train detectnet_v2 network.
$ docker run --runtime=nvidia -it --rm -v /your/local/folder:/docker/folder nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5

blakec · April 16, 2024, 5:40pm

Thank you @Morganh

The reason why I changed from TAO4 to TAO5 is that training with 2 GPUs is crashing with TAO4, but works with TAO5.

See error message below when I start the docker for TAO4. I start the docker for training with:

docker run -it --rm --runtime nvidia \
        -v $CWD:/dli/task \
        nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 \
        detectnet_v2 train \
        -e /dli/task/spec_files/combined_training_config.txt \
        -r /dli/task/tao_project/models/resnet18_detector \
        -k tlt_encode \
        -n resnet18_detector \
        --gpus 2

Error message:

[b8d39df045cf:187  :0:356]      cma_ep.c:81   process_vm_writev(pid=188 {0x72c209a02200,37632}-->{0x77f534b5eb00,37632}) returned -1: Operation not permitted
==== backtrace (tid:    356) ====                                                                                                                                                                                0 0x00000000000039f2 uct_cma_ep_tx_error()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:81                                     1 0x0000000000003d66 uct_cma_ep_tx()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/cma/cma_ep.c:114
 2 0x000000000001e209 uct_scopy_ep_progress_tx()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/sm/scopy/base/scopy_ep.c:151                            3 0x00000000000516d6 ucs_arbiter_dispatch_nonempty()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.c:321                           4 0x000000000001dcf1 ucs_arbiter_dispatch()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/arbiter.h:386                                    5 0x0000000000052467 ucs_callbackq_slow_proxy()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.c:404
 6 0x000000000004be9a ucs_callbackq_dispatch()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucs/datastruct/callbackq.h:211
 7 0x000000000004be9a uct_worker_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/uct/api/uct.h:2647
 8 0x000000000004be9a ucp_worker_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-779da13/src/ucp/core/ucp_worker.c:2804
 9 0x0000000000037144 opal_progress()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/runtime/opal_progress.c:231
10 0x000000000003dc05 ompi_sync_wait_mt()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/opal/threads/wait_sync.c:85
11 0x0000000000055fba ompi_request_default_wait_all()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/request/req_wait.c:
234
12 0x00000000000929d3 ompi_coll_base_bcast_intra_generic()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/
coll_base_bcast.c:98
13 0x0000000000092cc2 ompi_coll_base_bcast_intra_bintree()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/base/
coll_base_bcast.c:272
14 0x0000000000006840 ompi_coll_tuned_bcast_intra_dec_fixed()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mca/coll/tu
ned/coll_tuned_decision_fixed.c:649
15 0x000000000006cc11 PMPI_Bcast()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:114
16 0x000000000006cc11 PMPI_Bcast()  /build-result/src/hpcx-v2.12-gcc-inbox-ubuntu20.04-cuda11-gdrcopy2-nccl2.12-x86_64/ompi-1c67bf1c6a156f1ae693f86a38f9d859e99eeb1f/ompi/mpi/c/profile/pbcast.c:41
17 0x00000000001055c2 horovod::common::MPIBroadcast::Execute()  /opt/horovod/horovod/common/ops/mpi_operations.cc:395
18 0x00000000001055c2 horovod::common::TensorShape::~TensorShape()  /opt/horovod/horovod/common/ops/../common.h:234
19 0x00000000001055c2 horovod::common::MPIBroadcast::Execute()  /opt/horovod/horovod/common/ops/mpi_operations.cc:396
20 0x00000000000da52d horovod::common::OperationManager::ExecuteBroadcast()  /opt/horovod/horovod/common/ops/operation_manager.cc:66
21 0x00000000000da901 horovod::common::OperationManager::ExecuteOperation()  /opt/horovod/horovod/common/ops/operation_manager.cc:116
22 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:297                                                                   [364/9296]23 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=()  /usr/include/c++/9/bits/shared_ptr_base.h:1265
24 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=()  /usr/include/c++/9/bits/shared_ptr.h:335
25 0x00000000000a902d horovod::common::Event::operator=()  /opt/horovod/horovod/common/common.h:185
26 0x00000000000a902d horovod::common::Status::operator=()  /opt/horovod/horovod/common/common.h:197
27 0x00000000000a902d PerformOperation()  /opt/horovod/horovod/common/operations.cc:297
28 0x00000000000a902d RunLoopOnce()  /opt/horovod/horovod/common/operations.cc:787
29 0x00000000000a902d BackgroundThreadLoop()  /opt/horovod/horovod/common/operations.cc:651
30 0x00000000000d6de4 std::error_code::default_error_condition()  ???:0
31 0x0000000000008609 start_thread()  ???:0
32 0x000000000011f133 clone()  ???:0
=================================
[b8d39df045cf:00187] *** Process received signal ***
[b8d39df045cf:00187] Signal: Aborted (6)
[b8d39df045cf:00187] Signal code:  (-6)
[b8d39df045cf:00187] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x72c747dc2090]
[b8d39df045cf:00187] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x72c747dc200b]
[b8d39df045cf:00187] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x72c747da1859]
[b8d39df045cf:00187] [ 3] /opt/hpcx/ucx/lib/libucs.so.0(+0x5a7dd)[0x72c713fb57dd]
[b8d39df045cf:00187] [ 4] /opt/hpcx/ucx/lib/libucs.so.0(+0x5fdc2)[0x72c713fbadc2]
[b8d39df045cf:00187] [ 5] /opt/hpcx/ucx/lib/libucs.so.0(ucs_log_dispatch+0xe4)[0x72c713fbb194]
[b8d39df045cf:00187] [ 6] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(+0x39f2)[0x72c7141c59f2]
[b8d39df045cf:00187] [ 7] /opt/hpcx/ucx/lib/ucx/libuct_cma.so.0(uct_cma_ep_tx+0x186)[0x72c7141c5d66]
[b8d39df045cf:00187] [ 8] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_ep_progress_tx+0x69)[0x72c713f3a209]
[b8d39df045cf:00187] [ 9] /opt/hpcx/ucx/lib/libucs.so.0(ucs_arbiter_dispatch_nonempty+0xb6)[0x72c713fac6d6]
[b8d39df045cf:00187] [10] /opt/hpcx/ucx/lib/libuct.so.0(uct_scopy_iface_progress+0x81)[0x72c713f39cf1]
[b8d39df045cf:00187] [11] /opt/hpcx/ucx/lib/libucs.so.0(+0x52467)[0x72c713fad467]
[b8d39df045cf:00187] [12] /opt/hpcx/ucx/lib/libucp.so.0(ucp_worker_progress+0x6a)[0x72c71412de9a]
[b8d39df045cf:00187] [13] /opt/hpcx/ompi/lib/libopen-pal.so.40(opal_progress+0x34)[0x72c716dcb144]
[b8d39df045cf:00187] [14] /opt/hpcx/ompi/lib/libopen-pal.so.40(ompi_sync_wait_mt+0xb5)[0x72c716dd1c05]
[b8d39df045cf:00187] [15] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_request_default_wait_all+0x3ca)[0x72c716fb8fba]
[b8d39df045cf:00187] [16] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_generic+0x503)[0x72c716ff59d3]
[b8d39df045cf:00187] [17] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_coll_base_bcast_intra_bintree+0xc2)[0x72c716ff5cc2]
[b8d39df045cf:00187] [18] /opt/hpcx/ompi/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x40)[0x72c713c54840]
[b8d39df045cf:00187] [19] /opt/hpcx/ompi/lib/libmpi.so.40(MPI_Bcast+0x41)[0x72c716fcfc11]
[b8d39df045cf:00187] [20] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common12MPIBroadcast7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EER
KNS0_8ResponseE+0x3e2)[0x72c70e1df5c2]
[b8d39df045cf:00187] [21] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteBroadcastERSt6vectorINS0_16TensorTable
EntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x72c70e1b452d]
[b8d39df045cf:00187] [22] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTable
EntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x151)[0x72c70e1b4901]
[b8d39df045cf:00187] [23] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x72c70e18302d]
[b8d39df045cf:00187] [24] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x72c74712cde4]
[b8d39df045cf:00187] [24] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x72c74712cde4]
[b8d39df045cf:00187] [25] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x72c747d64609]
[b8d39df045cf:00187] [26] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x72c747e9e133]
[b8d39df045cf:00187] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node b8d39df045cf exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Execution status: FAIL

blakec · April 16, 2024, 6:38pm

The output of nvidia-smi:

$ nvidia-smi
Tue Apr 16 20:37:09 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:17:00.0 Off |                  Off |
| 66%   84C    P2             212W / 230W |   6743MiB / 24564MiB |     83%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               Off | 00000000:65:00.0 Off |                  Off |
| 30%   50C    P8              18W / 230W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(Computing at the moment on the first gpu.)

Morganh · April 20, 2024, 4:33pm

@blakec ,
For TAO5, it is needed to tune parameters. For example, there is an improvement mentioned in Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases - #8 by Morganh. Could you try on that?

blakec · April 23, 2024, 12:45pm

@Morganh I did that, but the above mentioned crash happens during startup in TAO4 with 2 gpus, before it starts computing. I don’t think it has to do with hyperparameters. But don’t worry. I’ll keep it on single gpu for now.

masaki_yamagishi · May 3, 2024, 12:37pm

Hi, @ Morganh

Thanks for the info. Will check the gap between TAO v4 and v5…

Do you have any update on this?

Best regards,

Masaki Yamagishi / Ryoyo

Morganh · May 4, 2024, 11:21am

Hi,
The DetectNet_v2 will port to TensorFlow 2.0. TAO team is working on that and it may be available in future release.

Topic		Replies	Views
Yolov4 multi-gpu training with Darknet Arch encounters a problem TAO Toolkit	17	743	July 2, 2023
Classification_pyt error TAO Toolkit jetson	16	85	September 18, 2024
Detectnet_v2 notebook stuck at tfrecords conversion step TAO Toolkit	17	50	October 30, 2024
TAO training on multiple gpus failed TAO Toolkit	10	1143	March 9, 2023
Getting 0 mAP for detectnet_v2 model over 150 epochs TAO Toolkit	14	55	January 11, 2025
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1708	August 22, 2023
TAO5 - Detectnet_v2 - MultiGPU TAO API Stuck TAO Toolkit	80	2088	October 11, 2023
Calculate mAP of tlt using custom dataset TAO Toolkit	15	840	October 3, 2021
Tao-converter [ERROR] Failed to parse the model, please check the encoding key to make sure its correct TAO Toolkit deepstream	70	1684	July 10, 2023
Can't run the provided TAO toolkit sample code TAO Toolkit	20	1983	July 31, 2023

Give me some instructions to improve mAP% from 0.0 % which was appeared executing the Notebook of TAO-Toolkit-Whitepaper-use-cases

Related topics