Tlt.components.docker_handler.docker_handler: Stopping container

ngccli does not work.
How can I work avoid this error?

This is happening in all of the EC2 containers.

Converting Tfrecords for kitti trainval dataset
2022-06-15 07:52:49,833 [INFO] root: Registry: ['nvcr.io']
2022-06-15 07:52:49,943 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
2022-06-15 07:52:49,954 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ubuntu/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2022-06-15 07:52:51,939 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
(launcher) ubuntu@ip-172-31-0-229:~$ docker images
REPOSITORY                          TAG                     IMAGE ID       CREATED       SIZE
nvcr.io/nvidia/tao/tao-toolkit-tf   v3.22.05-tf1.15.4-py3   ca92a571a959   3 weeks ago   16.1GB
↓
docker run -it ca92a571a959 bash
↓
  inflating: /opt/ngccli/ngc-cli/frozenlist/_frozenlist.cpython-39-x86_64-linux-gnu.so
 extracting: /opt/ngccli/ngc-cli.md5
chmod: cannot access '/opt/ngccli/ngc': No such file or directory
(launcher) ubuntu@ip-172-31-0-229:~$ ll /opt/ngccli/ngc
-rwxr-xr-x 1 ubuntu ubuntu 7247104 Jun 13 20:21 /opt/ngccli/ngc*

Refer to 2nd workaround in Chmod: cannot access '/opt/ngccli/ngc': No such file or directory - #2 by Morganh

sorry.
this error occur.

(launcher) ubuntu@ip-172-31-0-229:/opt$ docker images
REPOSITORY                          TAG                     IMAGE ID       CREATED       SIZE
nvcr.io/nvidia/tao/tao-toolkit-tf   v3.22.05-tf1.15.4-py3   ca92a571a959   3 weeks ago   16.1GB
(launcher) ubuntu@ip-172-31-0-229:/opt$ docker run --runtime=nvidia -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 bin/bash
docker: Error response from daemon: Unknown runtime specified nvidia.
See 'docker run --help'.
(launcher) ubuntu@ip-172-31-0-229:/opt$

this command works.

docker run -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3 bash

Please install nvidia-docker2.

cluld you tell me please.
where is docker_handler.py ?

(launcher) ubuntu@ip-172-31-0-229:~$ which tao
/home/ubuntu/.virtualenvs/launcher/bin/tao
(launcher) ubuntu@ip-172-31-0-229:~$ pip3 --version
pip 22.1.2 from /home/ubuntu/.virtualenvs/launcher/lib/python3.8/site-packages/pip (python 3.8)

(launcher) ubuntu@ip-172-31-0-229:~/.virtualenvs/launcher/lib/python3.8/site-packages$ ll | grep tao
drwx------   2 ubuntu ubuntu   4096 Jun  8 07:28 nvidia_tao-0.1.23.dist-info/
Modify lib/python3.6/site-packages/tao/components/docker_handler/docker_handler.py . This file should be available when you install nvidia-tao.
VALID_DOCKER_ARGS = [“user”, “ports”, “shm_size”, “ulimits”, “privileged”, “network”]

to

VALID_DOCKER_ARGS = [“user”, “ports”, “shm_size”, “ulimits”, “privileged”, “network”, “entrypoint”]

Can you find
$ ls ~/.virtualenvs/launcher/lib/python3.8/site-packages/tlt

or

$ ls ~/.virtualenvs/launcher/lib/python3.8/site-packages/tao

(launcher) ubuntu@ip-172-31-0-229:~/.virtualenvs/launcher/lib/python3.8/site-packages$  ls ~/.virtualenvs/launcher/lib/python3.8/site-packages/tlt
__init__.py  __pycache__  components  config  entrypoint  license  version.py
(launcher) ubuntu@ip-172-31-0-229:~/.virtualenvs/launcher/lib/python3.8/site-packages$ ls ~/.virtualenvs/launcher/lib/python3.8/site-packages/tao
ls: cannot access '/home/ubuntu/.virtualenvs/launcher/lib/python3.8/site-packages/tao': No such file or directory
(launcher) ubuntu@ip-172-31-0-229:~/.virtualenvs/launcher/lib/python3.8/site-packages$

That’s it. You install nvidia-tlt instead of nvidia-tao.
So, please go ahead and modify docker_handler.py which locates at tlt folder.

I fixed it, but Jupyter docker has the same error.

■~/.tao_mounts.json

{
    "DockerOptions":{
        "entrypoint": "",
        "shm_size": "16G",
    "Mounts": [
        {
            "source": "/home/ubuntu/ubuntu/tmp/tlt-experiments/data",
            "destination": "/home/ubuntu/ubuntu/tmp/tlt-experiments/data"
        },
        {
            "source": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/specs",
            "destination": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/specs"
        },
        {
            "source": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/",
            "destination": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/"
        }
    ]
}
DEFAULT_DOCKER_PATH = "unix://var/run/docker.sock"
VALID_PORT_PROTOCOLS = ["tcp", "udp", "sctp"]
VALID_DOCKER_ARGS = ["user", "ports", "shm_size", "ulimits", "privileged", "network", "entrypoint"]
# Creating a new directory for the output tfrecords dump.
print("Converting Tfrecords for kitti trainval dataset")
!mkdir -p $LOCAL_DATA_DIR/tfrecords && rm -rf $LOCAL_DATA_DIR/tfrecords/*
!tao detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval
Converting Tfrecords for kitti trainval dataset
2022-06-15 08:59:14,460 [INFO] root: Registry: ['nvcr.io']
2022-06-15 08:59:14,557 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/entrypoint/entrypoint.py", line 113, in main
    local_instance.launch_command(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/instance_handler/local_instance.py", line 319, in launch_command
    docker_handler.run_container(command)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py", line 284, in run_container
    mount_data, env_vars, docker_options = self._get_mount_env_data()
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py", line 92, in _get_mount_env_data
    data = self._load_mounts_file(self._docker_mount_file)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py", line 77, in _load_mounts_file
    data = json.load(mfile)
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 20 column 1 (char 641)

It is a new error.
You add additional “,” in ~/.tao_mounts.json.

I got error.

(launcher) ubuntu@ip-172-31-0-229:~$ cat ~/.tao_mounts.json
{
    "DockerOptions": {
        "entrypoint": "",
        "shm_size": "16G"
    },
    "Mounts": [
        {
            "source": "/home/ubuntu/ubuntu/tmp/tlt-experiments/data",
            "destination": "/home/ubuntu/ubuntu/tmp/tlt-experiments/data"
        },
        {
            "source": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/specs",
            "destination": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/specs"
        },
        {
            "source": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/",
            "destination": "/home/ubuntu/ubuntu/tmp/tlt-experiments/detectnet_v2/"
        }
    ]
}
# Creating a new directory for the output tfrecords dump.
print("Converting Tfrecords for kitti trainval dataset")
!mkdir -p $LOCAL_DATA_DIR/tfrecords && rm -rf $LOCAL_DATA_DIR/tfrecords/*
!tao detectnet_v2 dataset_convert \
                  -d $SPECS_DIR/detectnet_v2_tfrecords_kitti_trainval.txt \
                  -o $DATA_DOWNLOAD_DIR/tfrecords/kitti_trainval/kitti_trainval

Converting Tfrecords for kitti trainval dataset
2022-06-15 09:14:11,986 [INFO] root: Registry: ['nvcr.io']
2022-06-15 09:14:12,103 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.4-py3
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/tao", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/entrypoint/entrypoint.py", line 113, in main
    local_instance.launch_command(
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/instance_handler/local_instance.py", line 319, in launch_command
    docker_handler.run_container(command)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py", line 289, in run_container
    self.start_container(volumes, env_variables, docker_options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py", line 248, in start_container
    docker_args = self.get_docker_option_args(docker_options)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py", line 229, in get_docker_option_args
    assert key in VALID_DOCKER_ARGS, (
AssertionError: The parameter "entrypoint" mentioned in the config file isn't a valid option.
Please choose one of the following: ['user', 'ports', 'shm_size', 'ulimits', 'privileged', 'network']

Please modify /home/ubuntu/.local/lib/python3.8/site-packages/tlt/components/docker_handler/docker_handler.py

it works.
but I want to ask you.
when I create a new EC2 instance this error occurs.
Can nvidia fix this officially?

What do you mean by “a new EC2 is created” ?

when I create EC2, I have to install ngc command.
before, it was not necessary.

.tao_mounts.json and docker_handler.py No modifications were necessary.

Can you attach full log for better understanding?

Please start from
(launcher) ubuntu@ip-172-31-0-229:~$

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.