Tlt.components.docker_handler.docker_handler: Stopping container

My docker was working before. Today run command below and have error.

!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                     -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned\
                     -k $KEY \
                     --gpus 1

The whole message before container stopped is

For multi-GPU, change --gpus based on your machine.
2022-06-15 02:15:27,288 [INFO] root: Registry: ['nvcr.io']
2022-06-15 02:15:27,528 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-06-15 02:15:27,712 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/sysadmin/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
2022-06-15 02:15:33,657 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What could be wrong?

For workaround, please try below in terminal.
$ docker run --runtime=nvidia -it --rm --entrypoint “” -v yourlocaldolder:dockerfolder nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

Why I can’t run from jupyter as before and how can I fix back?
The above mentioned command works. But I prefer using running command from Jupyter

Since yesterday, there is a new update in ngccli which results in this issue. Internal team is working on that to update the launcher with a fix.
Above is just a workaround. I will check another workaround for jupyter.

That means rather than running as normal in Jupyter, I run docker in bash and work inside container?
What should I set for yourlocaldolder:dockerfolder?

For triggering jupyter notebook, see below example,

$ docker run --runtime=nvidia -it --rm --entrypoint “” -v ~/demo_3.0:/workspace/demo_3.0 -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

then,

root@8d6c08489e41:/workspace# jupyter notebook --ip 0.0.0.0 --allow-root

This command
!tao info
gives
/bin/sh: 1: tao: not found

For this workaround, please ignore tao launcher.
For example, if you run training,

root@03d61af590ba:/workspace# mask_rcnn train --help
Using TensorFlow backend.
usage: mask_rcnn train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                       [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                       [--log_file LOG_FILE] -e EXPERIMENT_SPEC_FILE -k KEY -d
                       MODEL_DIR
                       {dataset_convert,evaluate,export,inference,inference_trt,prune,train}
                       ...

optional arguments:
  -h, --help            show this help message and exit
  --num_processes NUM_PROCESSES, -np NUM_PROCESSES
                        The number of horovod child processes to be spawned.
                        Default is -1(equal to --gpus).
  --gpus GPUS           The number of GPUs to be used for the job.
  --gpu_index GPU_INDEX [GPU_INDEX ...]
                        The indices of the GPU's to be used.
  --use_amp             Flag to enable Auto Mixed Precision.
  --log_file LOG_FILE   Path to the output log file.
  -e EXPERIMENT_SPEC_FILE, --experiment_spec_file EXPERIMENT_SPEC_FILE
                        Path to spec file. Absolute path or relative to
                        working directory. If not specified, default spec from
                        spec_loader.py is used.
  -k KEY, --key KEY     Key to save or load a .tlt model.
  -d MODEL_DIR, --model_dir MODEL_DIR
                        Dir to save or load a .tlt model.

tasks:
  {dataset_convert,evaluate,export,inference,inference_trt,prune,train}

This workaround will always work. Even after team has fixed the changes, does this workaround still can be used?

Yes.

Thanks start training. Let see how what else I still need to fix.
Yesterday issue still have. I am going to remove images giving the issue.

@Segmentator
Please create a new topic and share full command and full log.

The issue is solved using wrokaround method

Hi! I have the same issue with !tao n_gram train . I followed the steps above, and have n_gram: command not found error when launching n_gram from inside the container. Could you help please?

@dm13
For n_gram, from “$tao info --verbose”, it is in another container.

Please use the fix: update the nvidia-tao.
$ pip3 install nvidia-tao==0.1.24

Oh got it, thanks! Changed it to tao-toolkit-lm:v3.22.05-py3 and it works now

Any update/solution to this?

@BarcaBear
This topic is root caused and already has solution.
Please create a new topic for your case.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.