Question about running NGC for TLT3.0

@Morganh

Following the release of TLT3.0, I was trying to pull the latest container as shown on here - NVIDIA NGC - tlt-streamanalytics

Since I don’t have any NVIDIA powered hardware, I am running it on GCP. When I run the command that I used to run TLT2.0 on GCP -
docker run --runtime=nvidia -it -v /home/<username>/tlt-experiments:/workspace/tlt-experiments nvcr.io/nvidia/tlt-streamanalytics:<version> /bin/bash

It continuously threw me error saying that CUDA 11.1 and up is required. However, I had 11.0 on GCP machine, and when I tried updating it to 11.2 per this doc (Installation Guide Linux :: CUDA Toolkit Documentation), it asked me to update the cuda-driver to 460 and up.

The GCP instance I am running had cuda-driver-450.xx, and when I tried updating, it asked for nvidia-driver to be updated as well. This is what I used to deploy the image - Google Cloud Platform
Long story short, it was not possible to do, and I could not find an image on GCP market place that would allow me to get the right driver for this container (NVIDIA NGC - tlt-streamanalytics).

If I don’t have NVIDIA powered hardware, do you have any recommendation or tutorial on how to run it on the cloud? It doesn’t have to be GCP. It could also be AWS.

I also tried following TLT Launcher step as “recommended” in the doc (TLT Launcher — Transfer Learning Toolkit 3.0 documentation); however, I for some reason, could not get that to work either. I couldn’t install virtualenv or virtualenvwrapper as a regular user. When I finally managed to as a sudo user, tlt --help works but tlt detectnet_v2 --help would throw errors like this

(launcher) root@nvidia-gpu-cloud-image-3-vm:/home/a428tm# tlt detectnet_v2 --help
Traceback (most recent call last):
File “/root/.virtualenvs/launcher/bin/tlt”, line 8, in
sys.exit(main())
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/entrypoint/entrypoint.py”, line 114, in main
args[1:]
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py”,
line 262, in launch_command
docker_logged_in()
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/utils.py”, line 129
, in docker_logged_in
data = load_config_file(docker_config)
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/utils.py”, line 66,
in load_config_file
“No file found at: {}”.format(config_path)
AssertionError: Config path must be a valid unix path. No file found at: /root/.docker/config.json

I am getting stuck on the first few steps of this, and I would appreciate any pointer.

Than kyou

Actually it is not necessary to be a sudo user. Do you mean you install virtualenv or virtualenvwrapper under sudo user?

More, can you double check Requirements and Installation — Transfer Learning Toolkit 3.0 documentation

If you have followed the default installation instructions for docker-ce you may need to have sudo access to run docker commands. In order to circumvent this, TLT recommends you to follow these post-installation steps to make sure that the docker commands can be run without sudo

TLT 3.0 needs NVIDIA GPU driver v455.xx or above. Please check if GCP has.
See Requirements and Installation — Transfer Learning Toolkit 3.0 documentation

Install NVIDIA GPU driver v455.xx or above.

@Morganh
Thanks for the quick response.

GCP I am using has GPU - NVIDIA Tesla P100
And no, it had driver v450 (see first image), so I went ahead and updated to 460 by downloading the run file from here (Download NVIDIA, GeForce, Quadro, and Tesla Drivers) then following this step to install (NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation). Please see the second image for verification of 460 installation.

Even then, when I run
tlt detectnet_v2 --help

it throws this error -

(launcher) root@nvidia-gpu-cloud-image-3-vm:/home/jae# tlt detectnet_v2 --help

Traceback (most recent call last):
File “/root/.virtualenvs/launcher/bin/tlt”, line 8, in
sys.exit(main())
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/entrypoint/entrypoint.py”, line 114, in main
args[1:]
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/local_instance.py”,
line 262, in launch_command
docker_logged_in()
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/utils.py”, line 129
, in docker_logged_in
data = load_config_file(docker_config)
File “/root/.virtualenvs/launcher/lib/python3.6/site-packages/tlt/components/instance_handler/utils.py”, line 66,
in load_config_file
“No file found at: {}”.format(config_path)
AssertionError: Config path must be a valid unix path. No file found at: /root/.docker/config.json

I tried the exact same thing on Jupyter notebook, but its result is the same. Based on the third image, installation seems to be fine, but the error shows up when I run the command. Please see the fourth image

@Morganh
Do you happen to know if TLT3.0 has been tested on any Cloud environment yet?

1 Like

Can you try non-root user? Apparently above file is not available.

For docker setup, please note that below part in 3.0 user guide
https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/requirements_and_installation.html#installation-prerequisites

If you have followed the default installation instructions for docker-ce you may need to have sudo access to run docker commands. In order to circumvent this, TLT recommends you to follow these post-installation steps to make sure that the docker commands can be run without sudo.

@Morganh

I actually had to do couple of different stuff.

  1. Carefully read through the doc. Saw that P series GPUs are not compatible. So I spun up a new one with V series
  2. It may just be me but I couldn’t install virtualenv on GCP without sudo, so I went ahead and did that
  3. I had to do nvidia docker login as sudo also, which got rid of the issue i was facing (no file found)

For now, this is solved. Thank you for the quick response.

1 Like

@a428tm
One extra question, for your last comment, do you mean when you run on GCP, you must run with sudo to install virtualenv?

@Morganh

Correct. When I try running it as a regular user, it was having issue with finding the right path for python3 which prevented me from installing virtualenvwrapper

I tried official tutorial and others recommendation but it was not successful. Only way to make it work was installing as root user

Do you have any log?

For AssertionError: Config path must be a valid unix path. No file found at: /root/.docker/config.json, please consider below solution.
You need to run docker login nvcr.io

I reproduce this issue when I try to run TLT3.0 in a docker.(see Error when trying to run gazenet notebook - #18 by Morganh) Then fix the issue via above way. See the log as below.

root@5b8d6f41d8c8:/workspace# tlt ssd run ls
Traceback (most recent call last):
File “/usr/local/bin/tlt”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.6/dist-packages/tlt/entrypoint/entrypoint.py”, line 114, in main
args[1:]
File “/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/local_instance.py”, line 262, in launch_command
docker_logged_in()
File “/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/utils.py”, line 129, in docker_logged_in
data = load_config_file(docker_config)
File “/usr/local/lib/python3.6/dist-packages/tlt/components/instance_handler/utils.py”, line 66, in load_config_file
“No file found at: {}”.format(config_path)
AssertionError: Config path must be a valid unix path. No file found at: /root/.docker/config.json
root@5b8d6f41d8c8:/workspace#

root@5b8d6f41d8c8:/workspace# docker login nvcr.io
Username: $oauthtoken
Password:
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
docker login | Docker Documentation

Login Succeeded

root@5b8d6f41d8c8:/workspace# tlt ssd run ls
2021-03-18 11:39:08,023 [INFO] root: No mount points were found in the /root/.tlt_mounts.json file.
2021-03-18 11:39:08,023 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the ~/.tlt_mounts.json file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
EULA.pdf README.md examples
2021-03-18 11:39:10,142 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

@Morganh

So that’s the problem, I ran into the docker login issue because -
I tried installing virtualenv as a regular user, and it failed (please see the log below)

jae@nvidia-gpu-cloud-image-4-vm:~$ pip3 install virtualenv
Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/pip/_vendor/init.py”, line 33, in vendored
import(vendored_name, globals(), locals(), level=0)
ModuleNotFoundError: No module named ‘pip._vendor.pkg_resources’
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “/usr/bin/pip3”, line 9, in
from pip import main
File “/usr/lib/python3/dist-packages/pip/init.py”, line 22, in
from pip._vendor.requests.packages.urllib3.exceptions import DependencyWarning
File “/usr/lib/python3/dist-packages/pip/_vendor/init.py”, line 76, in
vendored(“pkg_resources”)
File “/usr/lib/python3/dist-packages/pip/_vendor/init.py”, line 36, in vendored
import(modulename, globals(), locals(), level=0)
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 656, in _load_unlocked
File “”, line 626, in _load_backward_compatible
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 3088, in

File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 3072, in
_call_aside
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 3101, in
_initialize_master_working_set
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 565, in
_build_master
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 558, in
init
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 614, in
add_entry
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 1964, in
find_on_path
File “/usr/share/python-wheels/pkg_resources-0.0.0-py2.py3-none-any.whl/pkg_resources/init.py”, line 2026, in
distributions_from_metadata
PermissionError: [Errno 13] Permission denied: ‘/usr/local/lib/python3.6/dist-packages/zipp-3.4.1.dist-info’

Because it was a permission error, I tried the installation as root user. When I did, I was able to install virtualenv and virtualenvwrapper.

Then, I had to do docker login again as root user. If I didn’t do that, it would fail.

I was able to run the TLT as root user. I think the error I am running into may be related to package installation rather than NVIDIA related. If there was NVIDIA’s pre made image on GCP Marketplace, I think it might be easier (with everything installed).

Hope this was what you are looking for…?