I want to use tlt pre-trained models.
I do not have root access to the server and only have docker access.
We use 3090 GPU.
Can I pull the docker and train models inside docker?
Should I have NVIDIA runtime?
Should I use tlt2 or tlt3?
Can I install tlt launcher inside the docker or I have to use a host machine?
Yes, if you want to run tlt docker inside another docker, there is an exmple for reference
Error when trying to run gazenet notebook - #18 by Morganh
thanks.
We do not have NVIDIA runtime. can we use all GPU instead?
What do you mean by “We do not have NVIDIA runtime”?
You can install the NVIDIA runtime packages (and their dependencies) after updating the package listing.
$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
thank you.
I have two more questions.
If I run tlt v3 inside docker, should I run docker using this command?
docker run --runtime=nvidia -it -v /workspace/tlt/tlt-experiments:/workspace/tlt-experiments -p 8888:8888 -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 /bin/bash
and how do I define ~/.tlt_mounts.json? Is it correct?
drive_map ={ "Mounts": [ { "source": "/workspace/tlt/tlt-experiments/yolo_v3/dataset_1_temp", "destination": "/workspace/tlt-experiments/data" }, { "source": "/workspace/tlt/local_dir", "destination": "/workspace/tlt-experiments/results" }, { "source": "/workspace/tlt/tlt-experiments/yolo_v3/specs", "destination": "/workspace/tlt-experiments/yolo_v3/specs" } ], "Envs": [ { "variable": "CUDA_DEVICE_ORDER", "value": "PCI_BUS_ID" } ], "DockerOptions": { "shm_size": "16G", "ulimits": { "memlock": -1, "stack": 67108864 }, "user": "1000:1000", "ports": { "8888": 8888 } } }
Above way is similar to TLT 2.0. It is possible for you to run this interactive connection.
But in TLT 3.0-dp or TLT3.0, normally you will install tlt-launcher. See https://docs.nvidia.com/tlt/tlt-user-guide/text/tlt_launcher.html. Then usually you can run tlt detectnet_v2 train xxx
without an interactive session.
Or use below way mentioned in above user guide.
Once you are inside the interactive session, you may run the command task and its associated subtask by calling the
<task> <subtask> <cli_args>
commands without thetlt
prefix.For example, to train a detectnet_v2 model in the interactive session, run the following command after invoking an interactive session using
tlt detectnet_v2
yes I want to use TLT 3.0 similar to TLT 2.0. I used above docker run command. then inside the container I installed launcher. But when I run
!tlt yolo_v3 train -e $SPECS_DIR/yolo_v3_train_resnet18_kitti.txt \
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
-k $KEY \
--gpus 1
I get this error:
2021-06-28 08:44:47,368 [INFO] root: Registry: ['nvcr.io']
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
ConnectionRefusedError: [Errno 111] Connection refused
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 449, in send
timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 727, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py", line 403, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py", line 734, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 677, in urlopen
chunked=chunked,
File "/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/lib/python3.6/http/client.py", line 1281, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
self.send(msg)
File "/usr/lib/python3.6/http/client.py", line 980, in send
self.connect()
File "/usr/local/lib/python3.6/dist-packages/docker/transport/unixconn.py", line 43, in connect
sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 205, in _retrieve_server_version
return self.version(api_version=False)["ApiVersion"]
File "/usr/local/lib/python3.6/dist-packages/docker/api/daemon.py", line 181, in version
return self._result(self._get(url), json=True)
File "/usr/local/lib/python3.6/dist-packages/docker/utils/decorators.py", line 46, in inner
return f(self, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 228, in _get
return self.get(url, **self._set_request_timeout(kwargs))
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 543, in get
return self.request('GET', url, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/bin/tlt", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py", line 114, in main
args[1:]
File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 259, in launch_command
docker_handler = self.handler_map[
File "/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py", line 114, in handler_map
docker_mount_file=os.getenv("LAUNCHER_MOUNTS", DOCKER_MOUNT_FILE)
File "/usr/local/lib/python3.6/dist-packages/tlt/components/docker_handler/docker_handler.py", line 47, in __init__
self._docker_client = docker.from_env()
File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 85, in from_env
timeout=timeout, version=version, **kwargs_from_env(**kwargs)
File "/usr/local/lib/python3.6/dist-packages/docker/client.py", line 40, in __init__
self.api = APIClient(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 188, in __init__
self._version = self._retrieve_server_version()
File "/usr/local/lib/python3.6/dist-packages/docker/api/client.py", line 213, in _retrieve_server_version
'Error while fetching server API version: {0}'.format(e)
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', ConnectionRefusedError(111, 'Connection refused'))
When you run below,
docker run --runtime=nvidia -it -v /workspace/tlt/tlt-experiments:/workspace/tlt-experiments -p 8888:8888 -v /var/run/docker.sock:/var/run/docker.sock nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 /bin/bash
You already login the 3.0 docker.
You can directly run
#
yolo_v3 train xxx
Thank you
my problem was solved
This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.