TAO Toolkit Container(for Conversational AI) Setup Issue

Continuing the discussion from Chmod: cannot access '/opt/ngccli/ngc': No such file or directory:

• Hardware : A100/V100
• Network Type: NA
• TLT Version: v3.22.05-py3
• How to reproduce the issue ? Running the following command:
sudo docker run --runtime=nvidia -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 results in:

chmod: cannot access '/opt/ngccli/ngc': No such file or directory

Hi,

I am aware this issue has been reported earlier and solutions have been suggested by @Morganh as has been quoted.

However, I am a docker amateur and hence have what are possibly silly doubts/questions as the following:

  1. How do I use the information in the quoted update? Does it mean I should abandon the container approach and follow https://pypi.org/project/nvidia-tao/installation instructions where this issue has been resolved?

  2. Or if it means that I need to update the nividia-tao version within the container, how do I enter it?

  3. Would the first workaround suggested( Just add this:entrypoint"") still work?

Yes, still work.

Suggest to use this solution.
Then login tao container via below command.
For example,
$ tao ssd run /bin/bash
or $ tao detectnet_v2 run /bin/bash
or etc.

I used the --entrypoint “” approach (sudo docker run --runtime=nvidia -it --entrypoint "" -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g --name tao3 nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash), landed inside the container, updated the nvidia-tao version, and used ran the tao ssd run /bin/bash command. However I am getting the error as shown in the screenshot.

I also tried the docker login command as suggested. But it says command not found as seen in the screen shot.

As suggested by you in Is there some spacial things about bpnet? A question about "tlt bpnet dataset_convert " for bpnet - #5 by Morganh, docker run hello-world is not working either.

There are two ways of running inside the tao container.

  1. Use “docker run”.
    Just as you run.
    $ sudo docker run --runtime=nvidia -it --entrypoint “” -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g --name tao3 nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash

It is already running inside tao 22.05 container. It is not needed to run "tao ssd " inside again.
Just need to run something similar to below.
# ssd train balabala

  1. Use tao launcher.
    $ tao ssd run /bin/bash

Both “docker” and “ssd” commands are not being recognised. Both commands result in command not found output. Please refer to the screenshots below.

Got the reason.
Please change to nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 docker if you want to run ssd network.

See below info.

$ tao info --verbose
Configuration of the TAO Toolkit Instance

dockers:
        nvidia/tao/tao-toolkit-tf:
                v3.22.05-tf1.15.5-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. faster_rcnn
                                6. emotionnet
                                7. efficientdet
                                8. fpenet
                                9. gazenet
                                10. gesturenet
                                11. heartratenet
                                12. lprnet
                                13. mask_rcnn
                                14. multitask_classification
                                15. retinanet
                                16. ssd
                                17. unet
                                18. yolo_v3
                                19. yolo_v4
                                20. yolo_v4_tiny
                                21. converter
                v3.22.05-tf1.15.4-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. detectnet_v2
        nvidia/tao/tao-toolkit-pyt:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. speech_to_text_conformer
                                4. action_recognition
                                5. pointpillars
                                6. pose_classification
                                7. spectro_gen
                                8. vocoder
                v3.21.11-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. text_classification
                                2. question_answering
                                3. token_classification
                                4. intent_slot_classification
                                5. punctuation_and_capitalization
        nvidia/tao/tao-toolkit-lm:
                v3.22.05-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. n_gram
format_version: 2.0
toolkit_version: 3.22.05
published_date: 05/25/2022

I am sorry if I wasn’t clear before. But I am trying to train speech_to_text_conformer network. I used ssd command as an example. I didn’t know there are separate images for different networks. So I need to continue using nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 right?

I ran the following command again:
sudo docker run --runtime=nvidia -it --entrypoint "" -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g --name tao6 nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash

Now I am able to download the spec files and will hopefully start training soon. Thank you for you patience and help so far.

A clarification: The data we use will remain local on the local machine, is that correct?

Yes, for speech_to_text_conformer.

No, you need to add -v /yourlocalfolder:/dockerfolder

Please search docker “-v” usage.

I understand the usage of “-v”. My doubt stems from my understanding that TAO toolkit needs internet for carrying out training the first time, is that correct? My reference for this info is the document you have written for offline training using TAO here. Can you please explain why exactly the internet is required apart from downloading the TAO image. Is there a risk of exposing our data to cloud servers at any point?

Yes, in order to “docker pull” the TAO container.

No others is required.

For running TAO in your local machines, your data still locates at your local machine.
Do you mean you are going to run TAO training in cloud server?

The TAO training will happen on a local machine.

Thank you so much for the clarifications. I have a much better understanding now.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.