Tlt.components.docker_handler.docker_handler: Stopping container

edit_or · June 15, 2022, 2:30am

My docker was working before. Today run command below and have error.

!tao mask_rcnn train -e $SPECS_DIR/maskrcnn_train_resnet50.txt \
                     -d $USER_EXPERIMENT_DIR/experiment_dir_unpruned\
                     -k $KEY \
                     --gpus 1

The whole message before container stopped is

For multi-GPU, change --gpus based on your machine.
2022-06-15 02:15:27,288 [INFO] root: Registry: ['nvcr.io']
2022-06-15 02:15:27,528 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3
2022-06-15 02:15:27,712 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/sysadmin/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
Using TensorFlow backend.
2022-06-15 02:15:33,657 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What could be wrong?

Morganh · June 15, 2022, 2:31am

For workaround, please try below in terminal.
$ docker run --runtime=nvidia -it --rm --entrypoint “” -v yourlocaldolder:dockerfolder nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

edit_or · June 15, 2022, 2:40am

Why I can’t run from jupyter as before and how can I fix back?
The above mentioned command works. But I prefer using running command from Jupyter

Morganh · June 15, 2022, 2:43am

Since yesterday, there is a new update in ngccli which results in this issue. Internal team is working on that to update the launcher with a fix.
Above is just a workaround. I will check another workaround for jupyter.

edit_or · June 15, 2022, 3:01am

That means rather than running as normal in Jupyter, I run docker in bash and work inside container?
What should I set for yourlocaldolder:dockerfolder?

Morganh · June 15, 2022, 3:10am

For triggering jupyter notebook, see below example,

$ docker run --runtime=nvidia -it --rm --entrypoint “” -v ~/demo_3.0:/workspace/demo_3.0 -p 8888:8888 nvcr.io/nvidia/tao/tao-toolkit-tf:v3.22.05-tf1.15.5-py3 /bin/bash

then,

root@8d6c08489e41:/workspace# jupyter notebook --ip 0.0.0.0 --allow-root

edit_or · June 15, 2022, 3:36am

This command
!tao info
gives
/bin/sh: 1: tao: not found

Morganh · June 15, 2022, 3:39am

For this workaround, please ignore tao launcher.
For example, if you run training,

root@03d61af590ba:/workspace# mask_rcnn train --help
Using TensorFlow backend.
usage: mask_rcnn train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS]
                       [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp]
                       [--log_file LOG_FILE] -e EXPERIMENT_SPEC_FILE -k KEY -d
                       MODEL_DIR
                       {dataset_convert,evaluate,export,inference,inference_trt,prune,train}
                       ...

optional arguments:
  -h, --help            show this help message and exit
  --num_processes NUM_PROCESSES, -np NUM_PROCESSES
                        The number of horovod child processes to be spawned.
                        Default is -1(equal to --gpus).
  --gpus GPUS           The number of GPUs to be used for the job.
  --gpu_index GPU_INDEX [GPU_INDEX ...]
                        The indices of the GPU's to be used.
  --use_amp             Flag to enable Auto Mixed Precision.
  --log_file LOG_FILE   Path to the output log file.
  -e EXPERIMENT_SPEC_FILE, --experiment_spec_file EXPERIMENT_SPEC_FILE
                        Path to spec file. Absolute path or relative to
                        working directory. If not specified, default spec from
                        spec_loader.py is used.
  -k KEY, --key KEY     Key to save or load a .tlt model.
  -d MODEL_DIR, --model_dir MODEL_DIR
                        Dir to save or load a .tlt model.

tasks:
  {dataset_convert,evaluate,export,inference,inference_trt,prune,train}

edit_or · June 15, 2022, 3:57am

This workaround will always work. Even after team has fixed the changes, does this workaround still can be used?

Morganh · June 15, 2022, 4:07am

Yes.

edit_or · June 15, 2022, 4:19am

Thanks start training. Let see how what else I still need to fix.
Yesterday issue still have. I am going to remove images giving the issue.

Morganh · June 15, 2022, 10:42am

@Segmentator
Please create a new topic and share full command and full log.

edit_or · June 16, 2022, 5:42am

The issue is solved using wrokaround method

dm13 · June 23, 2022, 12:50pm

Hi! I have the same issue with !tao n_gram train . I followed the steps above, and have n_gram: command not found error when launching n_gram from inside the container. Could you help please?

Morganh · June 26, 2022, 3:16pm

@dm13
For n_gram, from “$tao info --verbose”, it is in another container.

Please use the fix: update the nvidia-tao.
$ pip3 install nvidia-tao==0.1.24

dm13 · June 28, 2022, 4:15pm

Oh got it, thanks! Changed it to tao-toolkit-lm:v3.22.05-py3 and it works now

BarcaBear · July 11, 2022, 7:43pm

Any update/solution to this?

Morganh · July 12, 2022, 2:14am

@BarcaBear
This topic is root caused and already has solution.
Please create a new topic for your case.

system · July 26, 2022, 2:14am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Getting [INFO] tlt.components.docker_handler.docker_handler: Stopping container. Why does this occur and how to fix it? TAO Toolkit	20	1909	August 24, 2021
About tao_mounts.json and docker container stop in traning cell TAO Toolkit	7	862	July 6, 2022
TAO yolov4_tiny training fails with error TAO Toolkit	4	560	February 2, 2023
LPRNet Error TAO Toolkit	13	217	June 19, 2024
OSError: Specfile not found plz help TAO Toolkit	16	1583	October 12, 2021
Error when training with multiple GPUs in TAO TAO Toolkit	17	1889	May 4, 2023
Tao Docker container crashes after some time TAO Toolkit docker , tao	4	761	October 2, 2022
TAO 5.0.0. TF1 Container fail to run tao model yolo_v4 dataset_convert command TAO Toolkit	4	351	October 5, 2023
Error in TAO-Toolkit while training TAO Toolkit	2	1100	January 4, 2022
Problem with tlt file mounting TAO Toolkit	29	2331	January 6, 2022

Tlt.components.docker_handler.docker_handler: Stopping container

Related topics