(help needed) running tlt training with singularity

Hello all,
I’m new to TLT, and facing some difficulties to run training within Singularity, hope you might be able to help

my setup info:
• Hardware: GPU cluster contains 4 GPUs Tesla V100
• Network Type: Yolo_v4
• TLT Version: tlt_version: 3.0 and docker_tag: v3.0-py3

I was able to run the pre-flight check with no problem:
singularity run --nv docker://nvcr.io/hpc/preflightcheck:20.11
INFO: Creating SIF file…
INFO: The NVIDIA Driver was detected.
INFO: NVRM version: NVIDIA UNIX x86_64 Kernel Module 460.73.01 Thu Apr 1 21:40:36 UTC 2021
INFO: Found CUDA driver library: /.singularity.d/libs/libcuda.so.1
INFO: Latest CUDA supported version: 11020
INFO: Number of GPUs detected: 4
INFO: Detected Mellanox OFED version 5.0-0
WARNING: Unable to determine nv_peer_mem version
INFO: Number of InfiniBand devices detected: 1

• my spec file is almost the same as the sample file
yolo_v4_train_resnet18_kitti.txt (2.1 KB)

• i’m also able to build the singularity image from the following def-file:
tlt_image_test.def (1.4 KB)

• How to reproduce the issue:
the issue is when running the singularity image i get the following error:

singularity run --nv --bind ${LOCAL_DATA}/datasets/Kitti_Vision_OD2012/data:/workspace/tlt-experiments/data ./tlt_image_test.sif 
Running from within singularity now
2021-06-23 15:22:49,726 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tlt", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 114, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 259, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 114, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 47, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
[dawoud@vca-gpu-211-01 tlt-gpuc]$ singularity run --nv --bind ${LOCAL_DATA}/datasets/Kitti_Vision_OD2012/data:/workspace/tlt-experiments/data ./tlt_image_test.sif 
WARNING! Using --password via the CLI is insecure. Use --password-stdin.
WARNING! Your password will be stored unencrypted in /home/fe/dawoud/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
Running from within singularity now
2021-06-23 15:24:07,389 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tlt", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 114, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 259, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 114, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 47, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

i know that there is no docker daemon running inside my container nor on the cluster (if there is a workaround it that would be great?) but even without the docker login i get the same error but earlier :

Error while fetching server API version: {0}'.format(e) docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Please refer to chapter 7 of https://docs.nvidia.com/ngc/pdf/NGC-User-Guide.pdf
Then, when Interactive Shell with Singularity is ready, try to run following command.
yolo_v4 train -e xxx.txt -k yourkey

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.