Unable to successfully execute tao command in cv_samples_v1.4.0

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc): GeForce RTX 3090
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc): Detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here)
tao info --verbose:

Configuration of the TAO Toolkit Instance

dockers:
        nvidia/tao/tao-toolkit-tf:
                v3.22.05-tf1.15.5-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. faster_rcnn
                                6. emotionnet
                                7. efficientdet
                                8. fpenet
                                9. gazenet
                                10. gesturenet
                                11. heartratenet
                                12. lprnet
                                13. mask_rcnn
                                14. multitask_classification
                                15. retinanet
                                16. ssd
                                17. unet
                                18. yolo_v3
                                19. yolo_v4
                                20. yolo_v4_tiny
                                21. converter
                v3.22.05-tf1.15.4-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. detectnet_v2
        nvidia/tao/tao-toolkit-pyt:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. speech_to_text_conformer
                                4. action_recognition
                                5. pointpillars
                                6. pose_classification
                                7. spectro_gen
                                8. vocoder
                v3.21.11-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. text_classification
                                2. question_answering
                                3. token_classification
                                4. intent_slot_classification
                                5. punctuation_and_capitalization
        nvidia/tao/tao-toolkit-lm:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

• Training spec file(If have, please share here)
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I’m following the instructions in TAO Quick Start Guide, and I was able to train detectnet_v2 in cv_samples_v1.4.0 when using my computer with GeForce GTX 1660 SUPER. However, when I uses another computer with GeForce RTX 3090, I was stuck in executing the command !tao detectnet_v2 dataset_convert and here is the log:

Converting Tfrecords for kitti trainval dataset
2022-08-15 03:36:57,503 [INFO] root: Registry: ['nvcr.io']
2022-08-15 03:36:57,581 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-rueeond1 because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.

I don’t know what happened, since there isn’t other useful informations for me to deal with it.
Here is the result of nvidia-smi when executing the !tao detectnet_v2 dataset_convert:
(I’m pretty sure that the docker is able to use GPU)

Mon Aug 15 04:05:18 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:00:04.0 Off |                  N/A |
|  0%   58C    P8    18W / 350W |      1MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

PS. I had also tried to use Triton Server and again it can run on 1660 but fail to run on 3090
So is there anything else I can notice?

To narrow down, could you run inside the docker?
Please open an terminal, then run
$ tao detectnet_v2 run /bin/bash

then, in the docker,
# detectnet_v2 dataset_convert xxx

Hi Morganh
I decided to use my 1660 to train my model, sorry for bothering (I still tried to run the command as you say, and it failed again).
But I faced another question that is it possible to convert the etlt file to other framework except TensorRT? Since TensorRT only support GPU device.

No. The .etlt file can only convert to TensorRT engine.

Please share full log.

This is the result. it will stuck at “Using TensorFlow backend.”

So, do you mean that the model trained by TAO Toolkit cannot convert to other framework? (from .tlt file or .etlt file) But TAO Toolkit actually is using Tensorflow Backend right?

Which framework?

How about other command?
# detectnet_v2 -h

  1. Such as TensorFlow, ONNX or some other framework that can be deployed to Triton. Because what I really want to do is to deploy a TAO model to Triton and running on CPU.

  2. # detectnet_v2 -h is also stuck at “Using TensorFlow backend”.

The .etlt file can be converted to TensorRT engine. Then users can deploy it in Triton.
You can also deploy .etlt file into tao-toolkit-triton-apps directly. See GitHub - NVIDIA-AI-IOT/tao-toolkit-triton-apps: Sample app code for deploying TAO Toolkit trained models to Triton

For the stuck, it is not expected. Other user can run TAO in GeForce RTX 3090.
Can you check the software requirement?
https://docs.nvidia.com/tao/tao-toolkit/text/tao_toolkit_quick_start_guide.html#software-requirements

  1. I have tried to use NVIDIA-AI-IOT, but I’m just curious that is there any way to run TAO model on CPU (maybe converted from .tlt file not from .etlt file).
  2. Perhaps the reason is that my
    OS: Ubuntu 18.04
    docker-API: 1.41
    nvidia-docker2: 2.11.0-1
    (I don’t know how to check the version of nvidia-container-runtime)

Please download driver from 515.65.01 to 510 and retry.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.