• Hardware : A100/V100
• Network Type: NA
• TLT Version: v3.22.05-py3
• How to reproduce the issue ? Running the following command: sudo docker run --runtime=nvidia -it -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 results in:
chmod: cannot access '/opt/ngccli/ngc': No such file or directory
Hi,
I am aware this issue has been reported earlier and solutions have been suggested by @Morganh as has been quoted.
However, I am a docker amateur and hence have what are possibly silly doubts/questions as the following:
How do I use the information in the quoted update? Does it mean I should abandon the container approach and follow https://pypi.org/project/nvidia-tao/installation instructions where this issue has been resolved?
Or if it means that I need to update the nividia-tao version within the container, how do I enter it?
Would the first workaround suggested( Just add this:–entrypoint"") still work?
Suggest to use this solution.
Then login tao container via below command.
For example,
$ tao ssd run /bin/bash
or $ tao detectnet_v2 run /bin/bash
or etc.
I used the --entrypoint “” approach (sudo docker run --runtime=nvidia -it --entrypoint "" -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g --name tao3 nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash), landed inside the container, updated the nvidia-tao version, and used ran the tao ssd run /bin/bash command. However I am getting the error as shown in the screenshot.
There are two ways of running inside the tao container.
Use “docker run”.
Just as you run.
$ sudo docker run --runtime=nvidia -it --entrypoint “” -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g --name tao3 nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash
It is already running inside tao 22.05 container. It is not needed to run "tao ssd " inside again.
Just need to run something similar to below. # ssd train balabala
I am sorry if I wasn’t clear before. But I am trying to train speech_to_text_conformer network. I used ssd command as an example. I didn’t know there are separate images for different networks. So I need to continue using nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 right?
I ran the following command again: sudo docker run --runtime=nvidia -it --entrypoint "" -e NVIDIA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --shm-size=40g --name tao6 nvcr.io/nvidia/tao/tao-toolkit-pyt:v3.22.05-py3 /bin/bash
Now I am able to download the spec files and will hopefully start training soon. Thank you for you patience and help so far.
A clarification: The data we use will remain local on the local machine, is that correct?
I understand the usage of “-v”. My doubt stems from my understanding that TAO toolkit needs internet for carrying out training the first time, is that correct? My reference for this info is the document you have written for offline training using TAO here. Can you please explain why exactly the internet is required apart from downloading the TAO image. Is there a risk of exposing our data to cloud servers at any point?