Issue with tlt.components.docker_handler.docker_handler: Stopping container

Hello, I’m relatively new to NVIDIA so I apologize in advance for any mistakes/inaccuracies!

I’m currently trying to convert a COCO Dataset to TFRecords for it to run NVIDIA’s Mask-RCNN implementation with TAO. To do this, I’m running the download_and_preprocess_coco.sh script. However, I noticed that the script will attempt to clone GitHub - tensorflow/models: Models and examples built with TensorFlow into tf-models, and will always suddenly stop midway through the cloning process (usually at around 20%) without returning an error. I’ve checked all of the hardware requirements for NVIDIA TAO’s MASK-RCNN and they all seem to be satisfied. Sorry in advance for the miscellaneous debugging commands, I’ve been trying to find the root of the issue for the past few days.

2022-06-17 18:23:13,056 [INFO] root: Registry: ['nvcr.io']
2022-06-17 18:23:13,138 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-17 18:23:13,160 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/jeffreyzyh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
+ '[' -z /workspace/tao-experiments/data ']'
+ echo 'Cloning Tensorflow models directory (for conversion utilities)'
Cloning Tensorflow models directory (for conversion utilities)
+ ls
EULA.pdf				    README.md
NVIDIA_Deep_Learning_Container_License.pdf  tao-experiments
+ '[' '!' -e tf-models ']'
+ git clone http://github.com/tensorflow/models tf-models
Cloning into 'tf-models'...
warning: redirecting to https://github.com/tensorflow/models/
remote: Enumerating objects: 74566, done.
remote: Counting objects: 100% (54/54), done.
remote: Compressing objects: 100% (41/41), done.
2022-06-17 18:23:17,041 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

If it matters, I’m running this on a virtual machine, not locally.

Please check if Chmod: cannot access ‘/opt/ngccli/ngc’: No such file or directory - #2 by Morganh can help you.

Hello, thanks for the quick reply!

Since I was limited to one link in my earlier post I couldn’t note it, but I did look over that article: running the docker command does nothing for me unfortunately and the error persists when I run the TAO command again. However, when I try to use the second workaround, I can’t seem to be able to find lib/python3.6/site-packages/tao/components/docker_handler/docker_handler.py, could I have some guidance on how to locate it?

Any log for this?

Try to search the docker_handler.py.

Here is what my terminal says. I initially tried running it in Jupyter and running the TAO command afterward (which did not fix anything).

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ docker run -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 bash
root@ed1614d2fd39:/workspace# 

The TAO command I’m trying to run is tao mask_rcnn run bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR. I’ve replaced $SPECS_DIR and $DATA_DOWNLOAD_DIR with the respective directories, but I get the following message when running the command in the docker container:

bash: tao: command not found

Also, trying to search for docker_handler.py yields find: ‘docker_handler.py’: No such file or directory when run from the base of the VM (after exiting the docker container).

According to above log, you already use the 1st workaround to login the docker. It is not necessary to use tao-launcher again. Please run any task without tao.
For example,
# mask_rcnn train xxx

Thanks for the info. However, when I try to run the commands consecutively on Jupyter, I get the following error:

e]0;root@65354baaecc6: /workspacearoot@65354baaecc6:/workspace# /bin/bash: mask_rcnn: command not found

While when I run the commands on the terminal, I get the following error messages.

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ docker run -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 bash
root@b41745751a4a:/workspace# mask_rcnn run bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 14, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 263, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/export.py", line 8, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/export/exporter.py", line 26, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/base_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py", line 14, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 2, in <module>
    import pycuda.driver as cuda
  File "/usr/local/lib/python3.6/dist-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

There either seems to be a continued issue with accessing certain files, or something missing that I should download. Could I have some further guidance?
My Python version is 3.6.15, which satisfies the requirements outlined in the Mask-RCNN page.

Addendum: I replaced $SPECS_DIR and $DATA_DOWNLOAD_DIR with the precise paths and reran the command, but I get the same error.

Please use
docker run --runtime=nvidia

I see. I now get the current screen: does it mean that I shouldn’t run the conversion script and instead use the built-in dataset_convert command?

root@b2070f85dee7:/workspace# mask_rcnn run bash /workspace/tao-experiments/mask_rcnn/specs/download_and_preprocess_coco.sh /workspace/tao-experiments/data
Using TensorFlow backend.
usage: mask_rcnn [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                 [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                 [--log_file LOG_FILE]
                 {dataset_convert,evaluate,export,inference,inference_trt,prune,train}
                 ...
mask_rcnn: error: invalid choice: 'run' (choose from 'dataset_convert', 'evaluate', 'export', 'inference', 'inference_trt', 'prune', 'train')

There is no “mask_rcnn run bash”.

You can use

mask_rcnn dataset_convert

I see. So I would not be able to run the script outlined in cell 6 of this article? I will probably attempt to circumvent the script later using dataset_convert, but if it does not work, is there any way that I could still run the download_and_preprocess_coco.sh script in the workaround?

The cell 6 should work.

Sorry I may have misunderstood, but you mentioned in an earlier post that it’s not necessary to use the TAO launcher again. So how would I replicate Cell 6? Or did you mean that manually running the script’s contents with dataset_convert should work?

Can you try running in terminal as belows?

$ tao mask_rcnn run /bin/bash

then

# bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR

I ran tao mask_rcnn run /bin/bash, and it lets me into the docker workspace, but it closes it very quickly (after around 5 seconds), yielding the following messages.

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ tao mask_rcnn run /bin/bash
2022-06-20 01:28:37,925 [INFO] root: Registry: ['nvcr.io']
2022-06-20 01:28:38,006 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-20 01:28:38,023 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/jeffreyzyh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
root@1754805a316e:/workspace# 2022-06-20 01:28:44,297 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

However, if I manage to quickly copy paste bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR into the command line afterwards, it says that

root@611d9a462490:/workspace# bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR
bash: /download_and_preprocess_coco.sh: No such file or directory

Then, the docker kicks me out after around two seconds again.

OK, if you use “tao” , i.e., tao-launcher, as mentioned earlier for the workaround 2, I suggest you to search the docker_handler.py again.

$ sudo find / -name docker_handler.py

Please set explicit path of $SPECS_DIR .
And also make sure you set correct ~/.tao_mounts.json.

1 Like

This seems to have fixed it. However, only a portion of the coco dataset is being converted. Nonetheless, the main issue of this post has been resolved, so I’ll create another post if I’m unable to solve this problem. Thanks a lot for your time!

EDIT: Nvm, I found out why only part of the coco dataset is converted. So far the issue has been resolved, thanks again MorganH for your time.