Getting [INFO] tlt.components.docker_handler.docker_handler: Stopping container. Why does this occur and how to fix it?


I have made a new environment using conda. I have installed the pre-req and TLT in this environment following NVIDIA TAO Documentation guide. Since I am working in a conda env, I am not running any code in python virtual environment.

After installing the tlt package, I am trying to run the classification jupyter notebook given in the samples.

on the !tlt classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY command, the docker suddenly stops working and outputs:

( i made some changes and printed the formatted_command and volumes)

2021-06-17 20:24:36,503 [INFO] root: Registry: ['']

formatted_command:  bash -c 'docker exec -it 17d69f83a4d3a69cc27e7c21506d3dcb364c7e98b4860d3ff17eb1c240a61aa5 classification train -e /workspace/tlt-experiments/classification/specs/classification_spec.cfg -r /workspace/tlt-experiments/classification/output -k nvidia_tlt'

volumes: {'/home/omno/Desktop/umair/naturalImages/tlt': {'bind': '/workspace/tlt-experiments', 'mode': 'rw'}, '/home/omno/Desktop/umair/naturalImages/tlt/specs': {'bind': '/workspace/tlt-experiments/classification/specs', 'mode': 'rw'}}
ormatted_command: bash -c 'docker exec -it 17d69f83a4d3a69cc27e7c21506d3dcb364c7e98b4860d3ff17eb1c240a61aa5 classification train -e /workspace/tlt-experiments/classification/specs/classification_spec.cfg -r /workspace/tlt-experiments/classification/output -k nvidia_tlt'
Executing the command.
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dq5h5g_n because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/ The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/ The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

['model_config', 'train_config']
2021-06-17 20:24:43,634 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

We recommend you to raise this query in TLT forum for better assistance.


Can you attach the training spec file?
More, could you please attach the full log as a txt file?

conda activate newenv
jupyter notebook --allow-root

classification.ipynb (42.0 KB)
classification_spec.cfg (1.2 KB)

I have attached the jupyter notebook. How do I get the log file?

Not needed now, I can find the full log in your jupyter notebook.
Can you add a cell in the notebook and run below?
! tlt classification run cat $SPECS_DIR/classification_spec.cfg

sorry for the late reply. I lost the system on which i was working previously. I got a new system, I carried out the same steps as mentioned in tlt quick start guide.

I have made changes to the cfg file by adding the path to my custom train and test dataset. But ,unfortunately, tlt can not access it. I am getting the following error.

FileNotFoundError: [Errno 2] No such file or directory: '/home/omno/Desktop/umair/tlt-samples/classification/data/train'
2021-06-24 16:59:06,390 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please, have a look at my notebook and cfg file
classification.ipynb (45.1 KB)
classification_spec.cfg (1.3 KB)

The docker or tlt stops running suddenly when I run the following command
! tlt classification run cat $SPECS_DIR/classification_spec.cfg

2021-06-24 17:10:24,797 [INFO] root: Registry: ['']
model_config {
  arch: "resnet",
  n_layers: 18
  # Setting these parameters to true to match the template downloaded from NGC.
  use_batch_norm: true
  all_projections: true
  freeze_blocks: 0
  freeze_blocks: 1
  input_image_size: "3,224,224"
train_config {
  train_dataset_path: "/home/omno/Desktop/umair/tlt-samples/classification/data/train"
  val_dataset_path: "/home/omno/Desktop/umair/tlt-samples/classification/data/test"
  pretrained_model_path: "/home/omno/Desktop/umair/tlt-samples/classification/pretrained_resnet18/tlt_pretrained_classification_vresnet18/resnet_18.hdf5"
  optimizer {
    sgd {
    lr: 0.01
    decay: 0.0
    momentum: 0.9
    nesterov: False
  batch_size_per_gpu: 64
  n_epochs: 80
  n_workers: 16
  preprocess_mode: "caffe"
  enable_random_crop: True
  enable_center_crop: True
  label_smoothing: 0.0
  mixup_alpha: 0.1
  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005

  # learning_rate
  lr_config {
    step {
      learning_rate: 0.006
      step_size: 10
      gamma: 0.1
eval_config {
  eval_dataset_path: "/home/omno/Desktop/umair/tlt-samples/classification/data/test"
  model_path: "/workspace/tlt-experiments/classification/output/weights/resnet_080.tlt"
  top_k: 3
  batch_size: 256
  n_workers: 8
  enable_center_crop: True
2021-06-24 17:10:25,551 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

tlt should give an error explaining why did this stop working

When you run below, it works well according to the log.
! tlt classification run cat $SPECS_DIR/classification_spec.cfg

But you get stuck when you run training because
FileNotFoundError: [Errno 2] No such file or directory: '/home/omno/Desktop/umair/tlt-samples/classification/data/train'

Please run following command to check if your training images folder is available.
! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train |wc -l

getting the following output on running
! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train |wc -l

2021-06-24 17:43:43,917 [INFO] root: Registry: ['']
2021-06-24 17:43:44,652 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

How about
! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train

Are there any images?

no images output when running ! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train

2021-06-24 17:46:57,649 [INFO] root: Registry: ['']
ls: cannot access '/home/omno/Desktop/umair/tlt-samples/classification/data/train': No such file or directory
2021-06-24 17:46:58,406 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

So, please check if mount local directory to the docker.

these drives/mount points neede to be mapped to the docker. The launcher instance can be configured in the ~/.tlt_mounts.json file.

I made some changes to the paths. Have a look at these please
classification.ipynb (48.5 KB)

According to your tlt_mounts.json file,

  "            \"source\": \"/home/omno/Desktop/umair/tlt-samples/classification\",\r\n",
  "            \"destination\": \"/workspace/tlt-experiments\"\r\n",

Could you run below command to verify the images?
! tlt classification run ls /workspace/tlt-experiments/data/train

got this output

2021-06-24 17:53:20,687 [INFO] root: Registry: ['']
cardboard  glass  metal  paper	plastic  trash
2021-06-24 17:53:21,455 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

can you tell me what am i doing wrong?

So, you can get your images folder now. That’s the expected.
cardboard glass metal paper plastic trash

Is my model being trained? I dont see any weights being saved.

Please modify all the path inside the training spec. For example, as above, change it according to the tlt_mounts.json.

could you explain me why am I getting this error
FileNotFoundError: [Errno 2] No such file or directory: '/home/omno/Desktop/umair/tlt-samples/classification/data/train'

As mentioned above, see,

Since the TLT launcher users docker containers under the hood, these drives/mount points neede to be mapped to the docker. The launcher instance can be configured in the ~/.tlt_mounts.json file.

In the command line, the path should be the “destination” path inside the docker.

There is also a simple way for reference. You can set paths to the same.
For example,

“source”: “/home/omno/Desktop/umair/tlt-samples/classification”,
“destination” : “/home/omno/Desktop/umair/tlt-samples/classification”

1 Like

thank you, this worked for me!