TLT how to make sure tlt is speaking to docker thru correct HOST IP and so on

Please provide the following information when requesting support.

• Hardware RTX 2080ti
• Network Type MaskRCNN
• TLT Version docker_tag: v3.0-py3
• Training spec file - the example resnet50 spec file
• How to reproduce the issue ?

I’m trying !tlt mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt
-d $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
–gpus 1

and other tlt commands from within the maskrcnn.ipynb that comes with the samples. But I get
docker.errors.DockerException: Error while fetching server API version: UnixHTTPConnectionPool(host=‘localhost’, port=None): Read timed out. (read timeout=60)

My docker is setup with TLSVERIFY and should be accessed thru host IP address, not localhost or 0.0.0.0 otherwise the certificate doesn’t work. I think that is the problem as I see host=‘localhost’ in the above error.

My user-space bashrc has env
DOCKER_TLS_VERIFY=1
DOCKER_HOST=tcp://<>:2376
and I have the requisite ca.pem cert.pem in my ~/.docker directory
docker commands work fine

How should I fix up the tlt config?

Thanks

Can you run below command successfully?
$ tlt info --verbose

Yes - I get
$ tlt info --verbose
Configuration of the TLT Instance

dockers:
nvidia/tlt-streamanalytics:
docker_registry: nvcr.io
docker_tag: v3.0-py3
tasks:
1. augment
2. bpnet
3. classification
4. detectnet_v2
5. dssd
6. emotionnet
7. faster_rcnn
8. fpenet
9. gazenet
10. gesturenet
11. heartratenet
12. lprnet
13. mask_rcnn
14. multitask_classification
15. retinanet
16. ssd
17. unet
18. yolo_v3
19. yolo_v4
20. tlt-converter
nvidia/tlt-pytorch:
docker_registry: nvcr.io
docker_tag: v3.0-py3
tasks:
1. speech_to_text
2. speech_to_text_citrinet
3. text_classification
4. question_answering
5. token_classification
6. intent_slot_classification
7. punctuation_and_capitalization
format_version: 1.0
tlt_version: 3.0
published_date: 04/16/2021

To double check, I reran the line in the jupyter notebook:

!tlt mask_rcnn run bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR

and it also dies with:

2021-08-04 10:54:01,150 [INFO] root: Registry: [‘nvcr.io’]
Traceback (most recent call last):
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/connectionpool.py”, line 445, in _make_request
six.raise_from(e, None)
File “”, line 3, in raise_from
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/connectionpool.py”, line 440, in _make_request
httplib_response = conn.getresponse()
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/http/client.py”, line 1349, in getresponse
response.begin()
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/http/client.py”, line 316, in begin
version, status, reason = self._read_status()
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/http/client.py”, line 277, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), “iso-8859-1”)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/socket.py”, line 704, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/requests/adapters.py”, line 439, in send
resp = conn.urlopen(
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/connectionpool.py”, line 755, in urlopen
retries = retries.increment(
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/util/retry.py”, line 532, in increment
raise six.reraise(type(error), error, _stacktrace)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/packages/six.py”, line 770, in reraise
raise value
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/connectionpool.py”, line 699, in urlopen
httplib_response = self._make_request(
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/connectionpool.py”, line 447, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/urllib3/connectionpool.py”, line 336, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: UnixHTTPConnectionPool(host=‘localhost’, port=None): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/docker/api/client.py”, line 205, in _retrieve_server_version
return self.version(api_version=False)[“ApiVersion”]
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/docker/api/daemon.py”, line 181, in version
return self._result(self._get(url), json=True)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/docker/utils/decorators.py”, line 46, in inner
return f(self, *args, **kwargs)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/docker/api/client.py”, line 228, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/requests/sessions.py”, line 555, in get
return self.request(‘GET’, url, **kwargs)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/requests/sessions.py”, line 542, in request
resp = self.send(prep, **send_kwargs)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/requests/sessions.py”, line 655, in send
r = adapter.send(request, **kwargs)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/requests/adapters.py”, line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: UnixHTTPConnectionPool(host=‘localhost’, port=None): Read timed out. (read timeout=60)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/aritomo/anaconda3/envs/tlt/bin/tlt”, line 8, in
sys.exit(main())
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/tlt/entrypoint/entrypoint.py”, line 112, in main
local_instance.launch_command(
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/tlt/components/instance_handler/local_instance.py”, line 259, in launch_command
docker_handler = self.handler_map[
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/tlt/components/instance_handler/local_instance.py”, line 109, in handler_map
handler_map[map_val.docker_image] = DockerHandler(
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/tlt/components/docker_handler/docker_handler.py”, line 48, in init
self._api_client = docker.APIClient(base_url=docker_env_path)
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/docker/api/client.py”, line 188, in init
self._version = self._retrieve_server_version()
File “/home/aritomo/anaconda3/envs/tlt/lib/python3.9/site-packages/docker/api/client.py”, line 212, in _retrieve_server_version
raise DockerException(
docker.errors.DockerException: Error while fetching server API version: UnixHTTPConnectionPool(host=‘localhost’, port=None): Read timed out. (read timeout=60)

I ran in native shell

bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR

and the download executed and created the tfrecord files correctly.

It appears its not the command but something in tlt that is ill-behaved with respect to docker.

Thanks

I looked at : docker_handler.py and it clearly uses the unix://var/run/docker.sock which I believe my current security setup blocks as I’m using docker commands that go through a fixed host IP with an associated set of ca.pem etc. even on my local machine.

I double checked local_instance.py and it creates a DockerHandler but doesn’t seem to use any special docker path but the default in docker_handler.py

I believe the DEFAULT_DOCKER_PATH might need to be amended to tcp://hostip:port ???

If one looks at Low-level API — Docker SDK for Python 5.0.3 documentation

it appears to indicate something like that instead of the unix socket device

Please try below.
As mentioned in Error when trying to run gazenet notebook - #18 by Morganh , please add
-v /var/run/docker.sock:/var/run/docker.sock

No change

Can you run below command in terminal instead of jupyter notebook?
$ tlt mask_rcnn run ls

or
$ tlt mask_rcnn run /bin/bash

tlt command hangs on all run commands on my system with TLS secured docker. Apparently the usage of docker client via docker.APIClient() is not fully ready for TLS. It hangs waiting for daemon on unix:socket but that won’t work in my system setup with TLS_VERIFY.

tlt has a bug when one uses a TLS secured system with the env variables for DOCKER_HOST set, DOCKER_TLS_VERIFY=1, and DOCKER_CERT_PATH set.

In the file:
tlt/components/docker_handler/docker_handler.py

The code for class DockerHandler() has two different client creation members.

First one is self._docker_client = docker.from_env()

This one works fine because upon inspection of the docker distro code, it actually has code that tests and fetches the environment variables and sets up TLSConfig correctly.

The second one in the self._api_client = docker.APIClient() code - right now, it does not do anything to emulate setting up TLSConfig via the tls argument

APIClient() is low level, and does not have the fetch/test/config surrounding the environment variables that is built into docker.from_env()

So I rewrote this portion of code:

    params = docker.utils.utils.kwargs_from_env()
    if ('base_url' in params):
        docker_env_path = params['base_url']
    self._api_client = docker.APIClient(base_url=docker_env_path,tls=params['tls'])

and now TLT does not hang. This patch is a hack as it has no testing other than my system behind it, and it has no protections or tests of less-than-happy paths.

The documentation around this is less than complete - only hints on correct usage, so reading the actually docker python SDK code proved out some “usage” pointers. Namely docker/utils/utils.py

Some hints are in Low-level API — Docker SDK for Python 5.0.3 documentation and Using TLS - docker-py Documentation

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.