Issue with tlt.components.docker_handler.docker_handler: Stopping container

jeffreyzyh · June 17, 2022, 6:50pm

Hello, I’m relatively new to NVIDIA so I apologize in advance for any mistakes/inaccuracies!

I’m currently trying to convert a COCO Dataset to TFRecords for it to run NVIDIA’s Mask-RCNN implementation with TAO. To do this, I’m running the download_and_preprocess_coco.sh script. However, I noticed that the script will attempt to clone GitHub - tensorflow/models: Models and examples built with TensorFlow into tf-models, and will always suddenly stop midway through the cloning process (usually at around 20%) without returning an error. I’ve checked all of the hardware requirements for NVIDIA TAO’s MASK-RCNN and they all seem to be satisfied. Sorry in advance for the miscellaneous debugging commands, I’ve been trying to find the root of the issue for the past few days.

2022-06-17 18:23:13,056 [INFO] root: Registry: ['nvcr.io']
2022-06-17 18:23:13,138 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-17 18:23:13,160 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/jeffreyzyh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
+ '[' -z /workspace/tao-experiments/data ']'
+ echo 'Cloning Tensorflow models directory (for conversion utilities)'
Cloning Tensorflow models directory (for conversion utilities)
+ ls
EULA.pdf				    README.md
NVIDIA_Deep_Learning_Container_License.pdf  tao-experiments
+ '[' '!' -e tf-models ']'
+ git clone http://github.com/tensorflow/models tf-models
Cloning into 'tf-models'...
warning: redirecting to https://github.com/tensorflow/models/
remote: Enumerating objects: 74566, done.
remote: Counting objects: 100% (54/54), done.
remote: Compressing objects: 100% (41/41), done.
2022-06-17 18:23:17,041 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

If it matters, I’m running this on a virtual machine, not locally.

Morganh · June 18, 2022, 3:21am

Please check if Chmod: cannot access ‘/opt/ngccli/ngc’: No such file or directory - #2 by Morganh can help you.

jeffreyzyh · June 18, 2022, 11:49am

Hello, thanks for the quick reply!

Since I was limited to one link in my earlier post I couldn’t note it, but I did look over that article: running the docker command does nothing for me unfortunately and the error persists when I run the TAO command again. However, when I try to use the second workaround, I can’t seem to be able to find lib/python3.6/site-packages/tao/components/docker_handler/docker_handler.py, could I have some guidance on how to locate it?

Morganh · June 18, 2022, 3:31pm

Any log for this?

Try to search the docker_handler.py.

jeffreyzyh · June 18, 2022, 4:41pm

Here is what my terminal says. I initially tried running it in Jupyter and running the TAO command afterward (which did not fix anything).

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ docker run -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 bash
root@ed1614d2fd39:/workspace#

The TAO command I’m trying to run is tao mask_rcnn run bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR. I’ve replaced $SPECS_DIR and $DATA_DOWNLOAD_DIR with the respective directories, but I get the following message when running the command in the docker container:

bash: tao: command not found

Also, trying to search for docker_handler.py yields find: ‘docker_handler.py’: No such file or directory when run from the base of the VM (after exiting the docker container).

Morganh · June 19, 2022, 2:14am

jeffreyzyh:

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ docker run -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 bash
root@ed1614d2fd39:/workspace#

According to above log, you already use the 1st workaround to login the docker. It is not necessary to use tao-launcher again. Please run any task without tao.
For example,
# mask_rcnn train xxx

jeffreyzyh · June 19, 2022, 2:39pm

Thanks for the info. However, when I try to run the commands consecutively on Jupyter, I get the following error:

e]0;root@65354baaecc6: /workspacearoot@65354baaecc6:/workspace# /bin/bash: mask_rcnn: command not found

While when I run the commands on the terminal, I get the following error messages.

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ docker run -it --rm --entrypoint "" nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 bash
root@b41745751a4a:/workspace# mask_rcnn run bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR
Using TensorFlow backend.
Traceback (most recent call last):
  File "/usr/local/bin/mask_rcnn", line 8, in <module>
    sys.exit(main())
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/entrypoint/mask_rcnn.py", line 14, in main
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 263, in launch_job
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 47, in get_modules
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/scripts/export.py", line 8, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/mask_rcnn/export/exporter.py", line 26, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/base_exporter.py", line 22, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/export/tensorfile_calibrator.py", line 14, in <module>
  File "/usr/local/lib/python3.6/dist-packages/pycuda/autoinit.py", line 2, in <module>
    import pycuda.driver as cuda
  File "/usr/local/lib/python3.6/dist-packages/pycuda/driver.py", line 62, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory

There either seems to be a continued issue with accessing certain files, or something missing that I should download. Could I have some further guidance?
My Python version is 3.6.15, which satisfies the requirements outlined in the Mask-RCNN page.

Addendum: I replaced $SPECS_DIR and $DATA_DOWNLOAD_DIR with the precise paths and reran the command, but I get the same error.

Morganh · June 19, 2022, 2:45pm

Please use
docker run --runtime=nvidia

jeffreyzyh · June 19, 2022, 3:01pm

I see. I now get the current screen: does it mean that I shouldn’t run the conversion script and instead use the built-in dataset_convert command?

root@b2070f85dee7:/workspace# mask_rcnn run bash /workspace/tao-experiments/mask_rcnn/specs/download_and_preprocess_coco.sh /workspace/tao-experiments/data
Using TensorFlow backend.
usage: mask_rcnn [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                 [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                 [--log_file LOG_FILE]
                 {dataset_convert,evaluate,export,inference,inference_trt,prune,train}
                 ...
mask_rcnn: error: invalid choice: 'run' (choose from 'dataset_convert', 'evaluate', 'export', 'inference', 'inference_trt', 'prune', 'train')

Morganh · June 19, 2022, 4:03pm

There is no “mask_rcnn run bash”.

You can use

mask_rcnn dataset_convert

jeffreyzyh · June 19, 2022, 4:34pm

I see. So I would not be able to run the script outlined in cell 6 of this article? I will probably attempt to circumvent the script later using dataset_convert, but if it does not work, is there any way that I could still run the download_and_preprocess_coco.sh script in the workaround?

Morganh · June 20, 2022, 12:59am

The cell 6 should work.

jeffreyzyh · June 20, 2022, 1:14am

Sorry I may have misunderstood, but you mentioned in an earlier post that it’s not necessary to use the TAO launcher again. So how would I replicate Cell 6? Or did you mean that manually running the script’s contents with dataset_convert should work?

Morganh · June 20, 2022, 1:15am

Can you try running in terminal as belows?

$ tao mask_rcnn run /bin/bash

then

# bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR

jeffreyzyh · June 20, 2022, 1:31am

I ran tao mask_rcnn run /bin/bash, and it lets me into the docker workspace, but it closes it very quickly (after around 5 seconds), yielding the following messages.

(base) jeffreyzyh@maskrcnn-tao-test:~/train_mask_rcnn$ tao mask_rcnn run /bin/bash
2022-06-20 01:28:37,925 [INFO] root: Registry: ['nvcr.io']
2022-06-20 01:28:38,006 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3
2022-06-20 01:28:38,023 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/jeffreyzyh/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
root@1754805a316e:/workspace# 2022-06-20 01:28:44,297 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

However, if I manage to quickly copy paste bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR into the command line afterwards, it says that

root@611d9a462490:/workspace# bash $SPECS_DIR/download_and_preprocess_coco.sh $DATA_DOWNLOAD_DIR
bash: /download_and_preprocess_coco.sh: No such file or directory

Then, the docker kicks me out after around two seconds again.

Morganh · June 20, 2022, 1:48am

OK, if you use “tao” , i.e., tao-launcher, as mentioned earlier for the workaround 2, I suggest you to search the docker_handler.py again.

$ sudo find / -name docker_handler.py

Please set explicit path of $SPECS_DIR .
And also make sure you set correct ~/.tao_mounts.json.

jeffreyzyh · June 20, 2022, 2:56pm

This seems to have fixed it. However, only a portion of the coco dataset is being converted. Nonetheless, the main issue of this post has been resolved, so I’ll create another post if I’m unable to solve this problem. Thanks a lot for your time!

EDIT: Nvm, I found out why only part of the coco dataset is converted. So far the issue has been resolved, thanks again MorganH for your time.

system · July 4, 2022, 2:56pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.