Run TLT inside docker

I want to use tlt pre-trained models.
I do not have root access to the server and only have docker access.
We use 3090 GPU.
Can I pull the docker and train models inside docker?
Should I have NVIDIA runtime?
Should I use tlt2 or tlt3?
Can I install tlt launcher inside the docker or I have to use a host machine?

Yes, if you want to run tlt docker inside another docker, there is an exmple for reference
Error when trying to run gazenet notebook - #18 by Morganh

1 Like

thanks.
We do not have NVIDIA runtime. can we use all GPU instead?

What do you mean by “We do not have NVIDIA runtime”?
You can install the NVIDIA runtime packages (and their dependencies) after updating the package listing.

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2

1 Like

thank you.
I have two more questions.
If I run tlt v3 inside docker, should I run docker using this command?

docker run --runtime=nvidia -it -v /workspace/tlt/tlt-experiments:/workspace/tlt-experiments -p 8888:8888 -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 /bin/bash

and how do I define ~/.tlt_mounts.json? Is it correct?

drive_map ={
    "Mounts": [
        {
            "source": "/workspace/tlt/tlt-experiments/yolo_v3/dataset_1_temp",
            "destination": "/workspace/tlt-experiments/data"
        },
        {
            "source": "/workspace/tlt/local_dir",
            "destination": "/workspace/tlt-experiments/results"
        },
        {
            "source": "/workspace/tlt/tlt-experiments/yolo_v3/specs",
            "destination": "/workspace/tlt-experiments/yolo_v3/specs"
        }
    ],
    "Envs": [
        {
            "variable": "CUDA_DEVICE_ORDER",
            "value": "PCI_BUS_ID"
        }
    ],
    "DockerOptions": {
        "shm_size": "16G",
        "ulimits": {
            "memlock": -1,
            "stack": 67108864
        },
        "user": "1000:1000",
        "ports": {
            "8888": 8888
        }
    }
}

Above way is similar to TLT 2.0. It is possible for you to run this interactive connection.

But in TLT 3.0-dp or TLT3.0, normally you will install tlt-launcher. See TLT Launcher — Transfer Learning Toolkit 3.0 documentation. Then usually you can run tlt detectnet_v2 train xxx without an interactive session.

Or use below way mentioned in above user guide.

Once you are inside the interactive session, you may run the command task and its associated subtask by calling the <task> <subtask> <cli_args> commands without the tlt prefix.

For example, to train a detectnet_v2 model in the interactive session, run the following command after invoking an interactive session using tlt detectnet_v2

1 Like

yes I want to use TLT 3.0 similar to TLT 2.0. I used above docker run command. then inside the container I installed launcher. But when I run

!tlt yolo_v3 train -e $SPECS_DIR/yolo_v3_train_resnet18_kitti.txt \
                   -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                   -k $KEY \
                   --gpus 1

I get this error:

2021-06-28 08:44:47,368 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 734, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
    return self.request('GET', url, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/tlt", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 114, in main
    args[1:]
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 259, in launch_command
    docker_handler = self.handler_map[
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 114, in handler_map
    docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
  File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 47, in __init__
    self._docker_client = docker.from_env()
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
    timeout=timeout, version=version, **kwargs_from_env(**kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
    self._version = self._retrieve_server_version()
  File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
    'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))

When you run below,
docker run --runtime=nvidia -it -v /workspace/tlt/tlt-experiments:/workspace/tlt-experiments -p 8888:8888 -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 /bin/bash

You already login the 3.0 docker.
You can directly run
# yolo_v3 train xxx

1 Like

Thank you
my problem was solved

1 Like

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.