Getting [INFO] tlt.components.docker_handler.docker_handler: Stopping container. Why does this occur and how to fix it?

Description

I have made a new environment using conda. I have installed the pre-req and TLT in this environment following TLT Quick Start Guide — Transfer Learning Toolkit 3.0 documentation guide. Since I am working in a conda env, I am not running any code in python virtual environment.

After installing the tlt package, I am trying to run the classification jupyter notebook given in the samples.

on the !tlt classification train -e $SPECS_DIR/classification_spec.cfg -r $USER_EXPERIMENT_DIR/output -k $KEY command, the docker suddenly stops working and outputs:

( i made some changes and printed the formatted_command and volumes)

2021-06-17 20:24:36,503 [INFO] root: Registry: ['nvcr.io']

formatted_command:  bash -c 'docker exec -it 17d69f83a4d3a69cc27e7c21506d3dcb364c7e98b4860d3ff17eb1c240a61aa5 classification train -e /workspace/tlt-experiments/classification/specs/classification_spec.cfg -r /workspace/tlt-experiments/classification/output -k nvidia_tlt'

volumes: {'/home/omno/Desktop/umair/naturalImages/tlt': {'bind': '/workspace/tlt-experiments', 'mode': 'rw'}, '/home/omno/Desktop/umair/naturalImages/tlt/specs': {'bind': '/workspace/tlt-experiments/classification/specs', 'mode': 'rw'}}
ormatted_command: bash -c 'docker exec -it 17d69f83a4d3a69cc27e7c21506d3dcb364c7e98b4860d3ff17eb1c240a61aa5 classification train -e /workspace/tlt-experiments/classification/specs/classification_spec.cfg -r /workspace/tlt-experiments/classification/output -k nvidia_tlt'
Executing the command.
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-dq5h5g_n because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

['model_config', 'train_config']
2021-06-17 20:24:43,634 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Hi,
We recommend you to raise this query in TLT forum for better assistance.

Thanks!

@1.umairjavaid
Can you attach the training spec file?
More, could you please attach the full log as a txt file?

conda activate newenv
jupyter notebook --allow-root

classification.ipynb (42.0 KB)
classification_spec.cfg (1.2 KB)

I have attached the jupyter notebook. How do I get the log file?

Not needed now, I can find the full log in your jupyter notebook.
Can you add a cell in the notebook and run below?
! tlt classification run cat $SPECS_DIR/classification_spec.cfg

sorry for the late reply. I lost the system on which i was working previously. I got a new system, I carried out the same steps as mentioned in tlt quick start guide.

I have made changes to the cfg file by adding the path to my custom train and test dataset. But ,unfortunately, tlt can not access it. I am getting the following error.

FileNotFoundError: [Errno 2] No such file or directory: '/home/omno/Desktop/umair/tlt-samples/classification/data/train'
2021-06-24 16:59:06,390 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Please, have a look at my notebook and cfg file
classification.ipynb (45.1 KB)
classification_spec.cfg (1.3 KB)

The docker or tlt stops running suddenly when I run the following command
! tlt classification run cat $SPECS_DIR/classification_spec.cfg

2021-06-24 17:10:24,797 [INFO] root: Registry: ['nvcr.io']
model_config {
  arch: "resnet",
  n_layers: 18
  # Setting these parameters to true to match the template downloaded from NGC.
  use_batch_norm: true
  all_projections: true
  freeze_blocks: 0
  freeze_blocks: 1
  input_image_size: "3,224,224"
}
train_config {
  train_dataset_path: "/home/omno/Desktop/umair/tlt-samples/classification/data/train"
  val_dataset_path: "/home/omno/Desktop/umair/tlt-samples/classification/data/test"
  pretrained_model_path: "/home/omno/Desktop/umair/tlt-samples/classification/pretrained_resnet18/tlt_pretrained_classification_vresnet18/resnet_18.hdf5"
  optimizer {
    sgd {
    lr: 0.01
    decay: 0.0
    momentum: 0.9
    nesterov: False
  }
}
  batch_size_per_gpu: 64
  n_epochs: 80
  n_workers: 16
  preprocess_mode: "caffe"
  enable_random_crop: True
  enable_center_crop: True
  label_smoothing: 0.0
  mixup_alpha: 0.1
  # regularizer
  reg_config {
    type: "L2"
    scope: "Conv2D,Dense"
    weight_decay: 0.00005
  }

  # learning_rate
  lr_config {
    step {
      learning_rate: 0.006
      step_size: 10
      gamma: 0.1
    }
  }
}
eval_config {
  eval_dataset_path: "/home/omno/Desktop/umair/tlt-samples/classification/data/test"
  model_path: "/workspace/tlt-experiments/classification/output/weights/resnet_080.tlt"
  top_k: 3
  batch_size: 256
  n_workers: 8
  enable_center_crop: True
}
2021-06-24 17:10:25,551 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

tlt should give an error explaining why did this stop working

When you run below, it works well according to the log.
! tlt classification run cat $SPECS_DIR/classification_spec.cfg

But you get stuck when you run training because
FileNotFoundError: [Errno 2] No such file or directory: '/home/omno/Desktop/umair/tlt-samples/classification/data/train'

Please run following command to check if your training images folder is available.
! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train |wc -l

getting the following output on running
! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train |wc -l

2021-06-24 17:43:43,917 [INFO] root: Registry: ['nvcr.io']
2021-06-24 17:43:44,652 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.
1

How about
! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train

Are there any images?

no images output when running ! tlt classification run ls /home/omno/Desktop/umair/tlt-samples/classification/data/train

2021-06-24 17:46:57,649 [INFO] root: Registry: ['nvcr.io']
ls: cannot access '/home/omno/Desktop/umair/tlt-samples/classification/data/train': No such file or directory
2021-06-24 17:46:58,406 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

So, please check if mount local directory to the docker.
See TLT Launcher — Transfer Learning Toolkit 3.0 documentation

these drives/mount points neede to be mapped to the docker. The launcher instance can be configured in the ~/.tlt_mounts.json file.

I made some changes to the paths. Have a look at these please
classification.ipynb (48.5 KB)

According to your tlt_mounts.json file,

  "            \"source\": \"/home/omno/Desktop/umair/tlt-samples/classification\",\r\n",
  "            \"destination\": \"/workspace/tlt-experiments\"\r\n",

Could you run below command to verify the images?
! tlt classification run ls /workspace/tlt-experiments/data/train

got this output

2021-06-24 17:53:20,687 [INFO] root: Registry: ['nvcr.io']
cardboard  glass  metal  paper	plastic  trash
2021-06-24 17:53:21,455 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

can you tell me what am i doing wrong?

So, you can get your images folder now. That’s the expected.
cardboard glass metal paper plastic trash

Is my model being trained? I dont see any weights being saved.

Please modify all the path inside the training spec. For example, as above, change it according to the tlt_mounts.json.

could you explain me why am I getting this error
FileNotFoundError: [Errno 2] No such file or directory: '/home/omno/Desktop/umair/tlt-samples/classification/data/train'

As mentioned above, see TLT Launcher — Transfer Learning Toolkit 3.0 documentation,

Since the TLT launcher users docker containers under the hood, these drives/mount points neede to be mapped to the docker. The launcher instance can be configured in the ~/.tlt_mounts.json file.

In the command line, the path should be the “destination” path inside the docker.

There is also a simple way for reference. You can set paths to the same.
For example,

“source”: “/home/omno/Desktop/umair/tlt-samples/classification”,
“destination” : “/home/omno/Desktop/umair/tlt-samples/classification”

1 Like

thank you, this worked for me!