Tao toolkit container not installing

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc) : x86_64 GPU machine
• Network Type (Detectnet_v2/Faster_rcnn/Yolo_v4/LPRnet/Mask_rcnn/Classification/etc) : detectnet_v2
• TLT Version (Please run “tlt info --verbose” and share “docker_tag” here): Command ‘tlt’ not found,
• Training spec file(If have, please share here): NA
• How to reproduce the issue ? (This is for errors. Please share the command line and the detailed log here.)

I’m trying to do hands-on with tao-toolkit, following the steps provided in detectnet_v2

command used: tao detectnet_v2 dataset-convert [-h]
command is keep on running like output pasted below

tao detectnet_v2 --help
~/.tao_mounts.json wasn't found. Falling back to obtain mount points and docker configs from ~/.tlt_mounts.json.
Please note that this will be deprecated going forward.
2022-06-02 09:46:31,944 [INFO] root: Registry: ['nvcr.io']
2022-06-02 09:46:32,029 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-06-02 09:46:32,364 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2022-06-02 09:46:32,364 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf

Kindly help to fix the issue. No idea what am i missing in setup.

The “tlt” has been changed to name “tao”. If you install tao-launcher, you can run “$ tao info --verbose”.

When you run tao command firstly, it will pull the corresponding tao docker.
See the log “Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf”.

Can you check if the tao container is pulled?

Hi @Morganh ,

Output for command: tao info --verbose

Configuration of the TAO Toolkit Instance

dockers:
        nvidia/tao/tao-toolkit-tf:
                v3.21.11-tf1.15.5-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. augment
                                2. bpnet
                                3. classification
                                4. dssd
                                5. emotionnet
                                6. efficientdet
                                7. fpenet
                                8. gazenet
                                9. gesturenet
                                10. heartratenet
                                11. lprnet
                                12. mask_rcnn
                                13. multitask_classification
                                14. retinanet
                                15. ssd
                                16. unet
                                17. yolo_v3
                                18. yolo_v4
                                19. yolo_v4_tiny
                                20. converter
                v3.21.11-tf1.15.4-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. detectnet_v2
                                2. faster_rcnn
        nvidia/tao/tao-toolkit-pyt:
                v3.21.11-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. speech_to_text
                                2. speech_to_text_citrinet
                                3. text_classification
                                4. question_answering
                                5. token_classification
                                6. intent_slot_classification
                                7. punctuation_and_capitalization
                                8. action_recognition
                v3.22.02-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. spectro_gen
                                2. vocoder
        nvidia/tao/tao-toolkit-lm:
                v3.21.08-py3:
                        docker_registry: nvcr.io
                        tasks:
                                1. n_gram
format_version: 2.0
toolkit_version: 3.22.02
published_date: 02/28/2022

Output for command: tao list

Traceback (most recent call last):
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 426, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1377, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.6/socket.py", line 586, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 428, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/soundarrajan/Envs/tao_exp_venv/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/tlt/entrypoint/entrypoint.py", line 110, in main
    parsed_args
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py", line 329, in launch_command
    self.list_running_jobs()
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py", line 188, in list_running_jobs
    container_list = self._get_running_containers()
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py", line 167, in _get_running_containers
    return [container for container in self._docker_client.containers.list()
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/docker/models/containers.py", line 951, in list
    containers.append(self.get(r['Id']))
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/docker/models/containers.py", line 887, in get
    resp = self.client.api.inspect_container(container_id)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/docker/api/container.py", line 771, in inspect_container
    self._get(self._url("/containers/{0}/json", container)), True
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/soundarrajan/Envs/tao_exp_venv/lib/python3.6/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=60)

Output for command: tao detectnet_v2 --help
Running more than 1 hour showing the same output as below.

2022-06-02 11:54:51,708 [INFO] root: Registry: ['nvcr.io']
2022-06-02 11:54:51,794 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-06-02 11:54:52,122 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2022-06-02 11:54:52,123 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf

checked nvcr.io login using command: docker login
Login Succeeded.

Anything i’m missing? Kindly help to identify and fix

I have the same problem. Will stay here and listen.


Alex from Starburst

Could you pull the docker by yourself and then retry?
docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
and
docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

More, please run below.
$ docker login nvcr.io

Tried pulling docker images but it seems command stuck in somewhere. Running long time.

docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
Output:

docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
v3.21.11-tf1.15.5-py3: Pulling from nvidia/tao/tao-toolkit-tf
e4ca327ec0e7: Pulling fs layer
b99d76492afe: Pulling fs layer
b52f6fb756a5: Pulling fs layer
5a09baa528d6: Waiting
2df930949a05: Waiting
934eb401e46c: Waiting
3244eb9db036: Waiting
e2e27029eb8e: Waiting
bb65579dd223: Pulling fs layer
bb65579dd223: Waiting
19be41539f88: Waiting
cefdfdffffa4: Waiting
c492d17ef893: Waiting
c6a37e1a8568: Waiting
244a64bffce5: Waiting
936011990b9b: Pulling fs layer
a84d68ccf4da: Waiting
3b9afd93de94: Waiting
fecad3989e11: Waiting
934eb401e46c: Downloading [===============================================>   ]  959.8MB/1.021GB
ad5b50ee8663: Pulling fs layer
f85d966d579f: Waiting
80b7ce251537: Pulling fs layer
e6701f6773d4: Waiting
75e85ce3fde9: Waiting
b424c8ff2471: Waiting
c5ad1732190d: Pulling fs layer
f36e72a8fc08: Waiting
d1dba32b7409: Pulling fs layer
e1e8259b0476: Waiting
13c550a1976c: Waiting
26f5884ea1a0: Waiting
df4b1a529062: Waiting
701ee2460b7b: Waiting
eb1a068e65ec: Waiting
6e0d9e5ad798: Waiting
eb1b9ba93282: Waiting
1d285c8437a7: Waiting
dc9b98c807d3: Pulling fs layer
7ff4524f132b: Waiting
a8f553ea0f6d: Waiting
afcdfd18bdbf: Waiting
68ca29e1bcdf: Waiting
dba24d3c715e: Waiting
6a91ff246ae9: Waiting
2de9d8b1eb38: Pulling fs layer
d67132678720: Waiting
5d4b97c13890: Waiting
0ed778473a88: Waiting
23f022373558: Waiting
2aee1f60734a: Waiting
186fe7ecf6ac: Pulling fs layer
b345bbc26e3a: Waiting
18f3d7afd72b: Waiting
fe8df937055c: Waiting
eb1c692023fa: Waiting
bd5e16956889: Pulling fs layer

docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
Output:

docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
v3.21.11-tf1.15.4-py3: Pulling from nvidia/tao/tao-toolkit-tf
e4ca327ec0e7: Already exists
b99d76492afe: Already exists
b52f6fb756a5: Already exists
5a09baa528d6: Already exists
2df930949a05: Already exists
934eb401e46c: Downloading [===============================================>   ]  959.8MB/1.021GB
3244eb9db036: Download complete
e2e27029eb8e: Downloading [=========================================>         ]  1.002GB/1.22GB
bb65579dd223: Download complete
f286d9ed18b6: Download complete
19be41539f88: Download complete
cefdfdffffa4: Downloading [======================>                            ]  847.4MB/1.9GB
c492d17ef893: Waiting
c6a37e1a8568: Waiting
244a64bffce5: Waiting
936011990b9b: Waiting
a84d68ccf4da: Waiting
3b9afd93de94: Waiting
fecad3989e11: Waiting
80727a1dd7d9: Waiting
ad5b50ee8663: Waiting
f85d966d579f: Waiting
80b7ce251537: Waiting
e6701f6773d4: Waiting
75e85ce3fde9: Waiting
d5204e30c651: Waiting
a5e96cc7d486: Pulling fs layer
2b4a743d384e: Waiting
2966a9405a32: Waiting
a6ee9d853f8b: Waiting
8555f3a80202: Waiting
fa03b5157f19: Waiting
53833e10ff45: Waiting
6eafdc015b75: Waiting
af9f43f64fe3: Waiting
27d6fa02bcfa: Waiting
fdaaa26bc895: Waiting
f76c248c9240: Waiting
a80d890b935f: Waiting
2e6bf771f34a: Waiting
4432665a6ac2: Waiting
91b860113bdd: Pulling fs layer
0c19e315f626: Pulling fs layer
ef3b7019500d: Waiting
08148f98f859: Waiting
3d7433a08679: Waiting
f7e8c22bc1cb: Waiting
606df40bb670: Waiting
a49d4b169dc5: Waiting
3c434869df88: Waiting
505c9a42229c: Pulling fs layer
ba767af43f7a: Waiting
f25dca5730a3: Waiting
a30ffc53d907: Waiting
61f789d0a78e: Waiting
91dc4b49789d: Waiting
c44ac0fe4f98: Waiting
92c240489b34: Waiting
  1. Any other alternative way that we can pull only specific tasks?
  2. I’m taking detectnet_v2 as example, so please share way to pull atleast the detectnet_v2 alone.

docker login nvcr.io
Output:

docker login nvcr.io
Authenticating with existing credentials...
WARNING! Your password will be stored unencrypted in /home/soundarrajan/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

According to “tao info --verbose”, if you run with detectnet_v2 network, please use nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

You can retry original command
$ tao detectnet_v2 dataset-convert -h

Tried all the way, no luck.

Nothing working.

First of all tao list command itself not showing docker list. Instead showing timeout error. Kindly refer the logs shared above.

Can you run below and share the result?
$ tao detectnet_v2 -h

:~/detectnet_v2$ tao detectnet_v2 --help
2022-06-02 18:19:45,560 [INFO] root: Registry: ['nvcr.io']
2022-06-02 18:19:45,651 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
2022-06-02 18:19:45,990 [INFO] tlt.components.docker_handler.docker_handler: The required docker doesn't exist locally/the manifest has changed. Pulling a new docker.
2022-06-02 **18:19:45**,990 [INFO] tlt.components.docker_handler.docker_handler: Pulling the required container. This may take several minutes if you're doing this for the first time. Please wait here.
...
Pulling from repository: nvcr.io/nvidia/tao/tao-toolkit-tf

already it is running since 18:19:45 hrs, it is running more than 30+ minutes. But not completed or failed. it’s keep on running…

  1. Any dependency i’m missing?
  2. I can able to login nvcr[.]io registry then why not pulling the container…

output for: docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

**docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3**
v3.21.11-tf1.15.4-py3: Pulling from nvidia/tao/tao-toolkit-tf
e4ca327ec0e7: Already exists
b99d76492afe: Already exists
b52f6fb756a5: Already exists
5a09baa528d6: Already exists
2df930949a05: Already exists
934eb401e46c: Downloading [===========================================>       ]  882.6MB/1.021GB
3244eb9db036: Download complete
e2e27029eb8e: Downloading [================================>                  ]  794.6MB/1.22GB
bb65579dd223: Download complete
f286d9ed18b6: Download complete
19be41539f88: Download complete
cefdfdffffa4: Downloading [======================>                            ]  858.6MB/1.9GB
c492d17ef893: Waiting
c6a37e1a8568: Waiting
244a64bffce5: Waiting
936011990b9b: Waiting
a84d68ccf4da: Waiting
3b9afd93de94: Waiting
fecad3989e11: Waiting
80727a1dd7d9: Waiting
ad5b50ee8663: Waiting
f85d966d579f: Waiting
80b7ce251537: Waiting
e6701f6773d4: Waiting
75e85ce3fde9: Waiting
d5204e30c651: Waiting
a5e96cc7d486: Waiting
2b4a743d384e: Waiting
2966a9405a32: Waiting
a6ee9d853f8b: Waiting
8555f3a80202: Waiting
fa03b5157f19: Waiting
53833e10ff45: Waiting
6eafdc015b75: Waiting
af9f43f64fe3: Waiting
27d6fa02bcfa: Waiting
fdaaa26bc895: Waiting
f76c248c9240: Waiting
a80d890b935f: Waiting
2e6bf771f34a: Waiting
4432665a6ac2: Waiting
91b860113bdd: Waiting
0c19e315f626: Waiting
ef3b7019500d: Waiting
08148f98f859: Waiting
3d7433a08679: Waiting
f7e8c22bc1cb: Waiting
606df40bb670: Waiting
a49d4b169dc5: Waiting
3c434869df88: Waiting
505c9a42229c: Waiting
ba767af43f7a: Waiting
f25dca5730a3: Waiting
a30ffc53d907: Waiting
61f789d0a78e: Waiting
91dc4b49789d: Waiting
c44ac0fe4f98: Waiting
92c240489b34: Waiting


Please stop it. It is not normal.

Can you pull it successfully?

Not it is not pulling containers successfully,
Command is keep on running, not completing or failing

Latest output: docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3

docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.4-py3
v3.21.11-tf1.15.4-py3: Pulling from nvidia/tao/tao-toolkit-tf
e4ca327ec0e7: Already exists
b99d76492afe: Already exists
b52f6fb756a5: Already exists
5a09baa528d6: Already exists
2df930949a05: Already exists
934eb401e46c: Downloading [===========================================>       ]  882.6MB/1.021GB
3244eb9db036: Download complete
e2e27029eb8e: Downloading [================================>                  ]  794.6MB/1.22GB
bb65579dd223: Download complete
f286d9ed18b6: Download complete
19be41539f88: Download complete
cefdfdffffa4: Downloading [======================>                            ]  858.6MB/1.9GB
c492d17ef893: Waiting
c6a37e1a8568: Waiting
244a64bffce5: Waiting
936011990b9b: Waiting
a84d68ccf4da: Waiting
3b9afd93de94: Waiting
fecad3989e11: Waiting
80727a1dd7d9: Waiting
cefdfdffffa4: Downloading [=>                                                 ]  22.54MB/1.037GB
f85d966d579f: Waiting
80b7ce251537: Waiting
e6701f6773d4: Waiting
75e85ce3fde9: Waiting
d5204e30c651: Waiting
a5e96cc7d486: Waiting
2b4a743d384e: Waiting
2966a9405a32: Waiting
a6ee9d853f8b: Waiting
8555f3a80202: Waiting
fa03b5157f19: Waiting
53833e10ff45: Waiting
6eafdc015b75: Waiting
af9f43f64fe3: Waiting
27d6fa02bcfa: Waiting
fdaaa26bc895: Waiting
f76c248c9240: Waiting
a80d890b935f: Waiting
2e6bf771f34a: Waiting
4432665a6ac2: Waiting
91b860113bdd: Waiting
0c19e315f626: Waiting
ef3b7019500d: Waiting
08148f98f859: Waiting
3d7433a08679: Waiting
f7e8c22bc1cb: Waiting
606df40bb670: Waiting
a49d4b169dc5: Waiting
3c434869df88: Waiting
505c9a42229c: Waiting
ba767af43f7a: Waiting
f25dca5730a3: Waiting
a30ffc53d907: Waiting
61f789d0a78e: Waiting
91dc4b49789d: Waiting
c44ac0fe4f98: Waiting
92c240489b34: Waiting

It is not expected. Please check your disk, network, etc. If possible, you can try to use another machine.

Or maybe the network speed is a little slow.

If network was not good it should not download from beginning, but currently something is pulling but stuck at middle. using 32GB RAM x86_64 GPU machine.

Any other way that we can manually download the container? instead of using docker pull command?

Could you try to use another machine to run “docker pull” ?

Hi @Morganh
Yes i tried in some other machine, it worked. I can able to run tao command now.

I tried detectnet_v2 dataset_convert sample, it was success but i couldn’t find the .tfrecord converted file.

dataset used: PASCAL VOC 2012 dataset → converted to KITTI format → dataset_convert_config file
data_convert_config_spec.txt (409 Bytes)

command: tao detectnet_v2 dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result --log_file /home/soundarrajan/detectnet_v2/result/dataset_convert_log.txt

Output:
dataset_convert_log.txt (4.3 KB)

.tao_mount.json file used:
.tao_mounts.json (491 Bytes)

kindly check, anything missing in config and help to fix it.

For debugging, please run into the docker and run again.
Step:
$ tao detectnet_v2 run /bin/bash

then

# dataset_convert -v -d /home/soundarrajan/detectnet_v2/config/data_convert_config_spec.txt -o /home/soundarrajan/detectnet_v2/result --log_file /home/soundarrajan/detectnet_v2/result/dataset_convert_log.txt

Hi @Morganh ,

I have tried both the way, actually the command was executed successfully as you can see the attached logs. But i couldn’t able to find the converted .tfrecord file in given output folder -o home/soundarrajan/detectnet_v2/result

In which path the converted .tfrecord will be saved to?
Kindly refer the config file and .tao_mounts.json file i attached previously.