Can I run "tlt lprnet train" command inside a docker container

tfuru2 · July 7, 2021, 8:35am

Hi.
I have been learning how to use TLT 3.0 with the tlt_cv_samples_v1.1.0/lprnet Jupyer notebook. I’m running the notebook in a container of the nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3 image. Is it possible to run the TLT traininig inside a docker container?
I have the following error at the notebook cell of tlt lprnet train. It seems that the error was from the spec file mounting fail for the nest of containers. Thanks.

print("For multi-GPU, change --gpus based on your machine.")
!tlt lprnet train --gpus=1 --gpu_index=$GPU_INDEX \
                  -e $SPECS_DIR/tutorial_spec.txt \
                  -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \
                  -k $KEY \
                  -m $USER_EXPERIMENT_DIR/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

For multi-GPU, change --gpus based on your machine.
2021-07-07 07:13:38,923 [INFO] root: Registry: ['nvcr.io']
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-07-07 07:13:45,963 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-07-07 07:13:45,964 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:57: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-07-07 07:13:46,112 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:57: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-07-07 07:13:46,113 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:61: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-07-07 07:13:46,479 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:61: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-07-07 07:13:46,479 [INFO] iva.lprnet.utils.spec_loader: Merging specification from /workspace/tlt-experiments/lprnet/specs/tutorial_spec.txt
Traceback (most recent call last):
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 277, in <module>
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 273, in main
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py", line 64, in run_experiment
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/spec_loader.py", line 126, in load_experiment_spec
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/spec_loader.py", line 106, in load_proto
  File "/opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/utils/spec_loader.py", line 91, in _load_from_file
FileNotFoundError: [Errno 2] No such file or directory: '/workspace/tlt-experiments/lprnet/specs/tutorial_spec.txt'
2021-07-07 07:13:47,540 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Morganh · July 7, 2021, 12:43pm

Yes. You can login the docker and run commands.

Please check your ~/tlt_mounts.json file. I am afraid there is something wrong in the path to spec.txt file inside docker.

tfuru2 · July 8, 2021, 5:13am

Thank you for your reply.
I understand that TLT 3.0 works inside docker container. So I assume that TLT container can be “docker in docker”.

I checked my .tlt_mounts.json but I could not find any wrong setting.

Here is my .tlt_mounts.json

# cat ~/.tlt_mounts.json
{
    "Mounts": [
        {
            "source": "/data/tlt-experiments",
            "destination": "/workspace/tlt-experiments"
        },
        {
            "source": "/workspace/tlt_cv_samples_v1.1.0/lprnet/specs",
            "destination": "/workspace/tlt-experiments/lprnet/specs"
        }
    ],
    "DockerOptions": {
        "user": "0:0"
    }
}

I understand that the above mount definition results in the following path translation.

TLT container path: /workspace/tlt-experiments/lprnet/specs/tutorial_spec.txt
Host path (my container path): /workspace/tlt_cv_samples_v1.1.0/lprnet/specs/tutorial_spec.txt

I confirmed that the training spec file exists at the path I expected.

# cat /workspace/tlt_cv_samples_v1.1.0/lprnet/specs/tutorial_spec.txt
random_seed: 42
lpr_config {
  hidden_units: 512
  max_label_length: 8
  arch: "baseline"
  nlayers: 18 #setting nlayers to be 10 to use baseline10 model
}
training_config {
  batch_size_per_gpu: 32
  num_epochs: 24
  learning_rate {
  soft_start_annealing_schedule {
    min_learning_rate: 1e-6
    max_learning_rate: 1e-5
    soft_start: 0.001
    annealing: 0.5
  }
  }
  regularizer {
    type: L2
    weight: 5e-4
  }
}
eval_config {
  validation_period_during_training: 5
  batch_size: 1
}
augmentation_config {
    output_width: 96
    output_height: 48
    output_channel: 3
    max_rotate_degree: 5
    rotate_prob: 0.5
    gaussian_kernel_size: 5
    gaussian_kernel_size: 7
    gaussian_kernel_size: 15
    blur_prob: 0.5
    reverse_color_prob: 0.5
    keep_original_prob: 0.3
}
dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/openalpr/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/openalpr/train/image"
  }
  characters_list_file: "/workspace/examples/lprnet/specs/us_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/openalpr/val/label"
    image_directory_path: "/workspace/tlt-experiments/data/openalpr/val/image"
  }
}

Thanks.

Morganh · July 8, 2021, 5:31am

No, by default TLT container is not running “docker in docker”. TLT 3.0 is a docker only.
By default, in tlt 3.0 docker, there is no /workspace/examples/lprnet/specs/us_lp_characters.txt inside. This is a minor issue in lprnet spec file. You can run below to check.
$ tlt lprnet run ls /workspace/examples/lprnet/specs/us_lp_characters.txt
So, please download us_lp_characters.txt from deepstream_tao_apps/us_lp_characters.txt at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub
And modify the training spec file. Thanks.

tfuru2 · July 8, 2021, 8:48am

Thank you for your advice.

I assume that my environment has the structure as shown in the below diagram. So I assume that the container launched from TLT is “docker in docker”.

I found us_lp_characters.txt in the /workspace/tlt_cv_samples_v1.1.0/lprnet/specs directory.
So I modified dataset_config in tutorial_spec.txt as follows. But I still have the same error as I already reported.

dataset_config {
  data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/openalpr/train/label"
    image_directory_path: "/workspace/tlt-experiments/data/openalpr/train/image"
  }
  characters_list_file: " /workspace/tlt-experiments/lprnet/specs/us_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/workspace/tlt-experiments/data/openalpr/val/label"
    image_directory_path: "/workspace/tlt-experiments/data/openalpr/val/image"
  }
}

Thanks.

Morganh · July 8, 2021, 9:13am

According to part of your ~/.tlt_mounts.json file, could you run below command and share the result?
$ tlt lprnet run ls /workspace/tlt-experiments/lprnet/specs

tfuru2 · July 8, 2021, 9:25am

No file found.

# tlt lprnet run ls /workspace/tlt-experiments/lprnet/specs
2021-07-08 09:22:51,490 [INFO] root: Registry: ['nvcr.io']
2021-07-08 09:22:52,970 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

Thanks.

Morganh · July 8, 2021, 9:27am

Are the files available in your local directory?
$ ls /workspace/tlt_cv_samples_v1.1.0/lprnet/specs

tfuru2 · July 9, 2021, 12:14am

Yes, I have them.

# ls /workspace/tlt_cv_samples_v1.1.0/lprnet/specs
tutorial_spec.txt  tutorial_spec_scratch.txt  us_lp_characters.txt

Please note the above ls command was not executed on the real local but on my container I described as the blue rounded rectangle in my diagram I posted yesterday.
With Google search, I found that many people are saying the volude mount for “docker in docker” does not work as they expected. Some people are saying that the inner docker container does not mount the volume on the outer docker container but the volume on the real host (shown as the orange rectangle in my diagram).
Because I’m not a system administrator, I cannot inspect the file systems on the real host.

Thanks.

Morganh · July 9, 2021, 2:31am

So, can you modify you ~/.tlt_mounts.json ?
Change from

    {
        "source": "/workspace/tlt_cv_samples_v1.1.0/lprnet/specs",
        "destination": "/workspace/tlt-experiments/lprnet/specs"
    }

to

    {
        "source": "your-local-path-to-lprnet-specs",
        "destination": "/workspace/tlt-experiments/lprnet/specs"
    }

tfuru2 · July 9, 2021, 6:00am

I got how to use TLT 3.0 inside a docker container.

I needed to run TLT with just the lprnet command. No “tlt” preceeding “lprnet”.
Then I needed to give the local directories to the lprnet command.

!lprnet train --gpus=1 --gpu_index=$GPU_INDEX \
                  -e $LOCAL_SPECS_DIR/tutorial_spec.txt \
                  -r $LOCAL_EXPERIMENT_DIR/experiment_dir_unpruned \
                  -k $KEY \
                  -m $LOCAL_EXPERIMENT_DIR/pretrained_lprnet_baseline18/tlt_lprnet_vtrainable_v1.0/us_lprnet_baseline18_trainable.tlt

I also modified the data path in the spec file.

dataset_config {
  data_sources: {
    label_directory_path: "/data/tlt-experiments/data/openalpr/train/label"
    image_directory_path: "/data/tlt-experiments/data/openalpr/train/image"
  }
  characters_list_file: "/workspace/tlt_cv_samples_v1.1.0/lprnet_debug/specs/us_lp_characters.txt"
  validation_data_sources: {
    label_directory_path: "/data/tlt-experiments/data/openalpr/val/label"
    image_directory_path: "/data/tlt-experiments/data/openalpr/val/image"
  }
}

It seemed that the h5py version in TLT 3.0 container was not good in my environment. So I reinstalled the h5py.

!pip3 show h5py
!pip3 uninstall h5py -y
!pip3 install h5py==2.10.0

Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2021-07-09 05:36:27,971 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-07-09 05:36:27,971 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:57: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

2021-07-09 05:36:28,111 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:57: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2021-07-09 05:36:28,111 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:60: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

WARNING:tensorflow:From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:61: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-07-09 05:36:28,438 [WARNING] tensorflow: From /opt/tlt/.cache/dazel/_dazel_tlt/2b81a5aac84a1d3b7a324f2a7a6f400b/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/lprnet/scripts/train.py:61: The name tf.keras.backend.set_session is deprecated. Please use tf.compat.v1.keras.backend.set_session instead.

2021-07-09 05:36:28,438 [INFO] iva.lprnet.utils.spec_loader: Merging specification from /workspace/tlt_cv_samples_v1.1.0/lprnet_debug/specs/tutorial_spec.txt
2021-07-09 05:36:28,439 [INFO] __main__: Loading pretrained weights. This may take a while...
Initialize optimizer
Model: "lpnet_baseline_18"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
image_input (InputLayer)        [(None, 3, 48, 96)]  0                                            
__________________________________________________________________________________________________
tf_op_layer_Sum (TensorFlowOpLa [(None, 1, 48, 96)]  0           image_input[0][0]                
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 48, 96)   640         tf_op_layer_Sum[0][0]            
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 48, 96)   256         conv1[0][0]                      
__________________________________________________________________________________________________
re_lu (ReLU)                    (None, 64, 48, 96)   0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
max_pooling2d (MaxPooling2D)    (None, 64, 48, 96)   0           re_lu[0][0]                      
__________________________________________________________________________________________________
res2a_branch2a (Conv2D)         (None, 64, 48, 96)   36928       max_pooling2d[0][0]              
__________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizati (None, 64, 48, 96)   256         res2a_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_1 (ReLU)                  (None, 64, 48, 96)   0           bn2a_branch2a[0][0]              
__________________________________________________________________________________________________
res2a_branch1 (Conv2D)          (None, 64, 48, 96)   4160        max_pooling2d[0][0]              
__________________________________________________________________________________________________
res2a_branch2b (Conv2D)         (None, 64, 48, 96)   36928       re_lu_1[0][0]                    
__________________________________________________________________________________________________
bn2a_branch1 (BatchNormalizatio (None, 64, 48, 96)   256         res2a_branch1[0][0]              
__________________________________________________________________________________________________
bn2a_branch2b (BatchNormalizati (None, 64, 48, 96)   256         res2a_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add (TensorFlowOpLa [(None, 64, 48, 96)] 0           bn2a_branch1[0][0]               
                                                                 bn2a_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_2 (ReLU)                  (None, 64, 48, 96)   0           tf_op_layer_add[0][0]            
__________________________________________________________________________________________________
res2b_branch2a (Conv2D)         (None, 64, 48, 96)   36928       re_lu_2[0][0]                    
__________________________________________________________________________________________________
bn2b_branch2a (BatchNormalizati (None, 64, 48, 96)   256         res2b_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_3 (ReLU)                  (None, 64, 48, 96)   0           bn2b_branch2a[0][0]              
__________________________________________________________________________________________________
res2b_branch2b (Conv2D)         (None, 64, 48, 96)   36928       re_lu_3[0][0]                    
__________________________________________________________________________________________________
bn2b_branch2b (BatchNormalizati (None, 64, 48, 96)   256         res2b_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_1 (TensorFlowOp [(None, 64, 48, 96)] 0           re_lu_2[0][0]                    
                                                                 bn2b_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_4 (ReLU)                  (None, 64, 48, 96)   0           tf_op_layer_add_1[0][0]          
__________________________________________________________________________________________________
res3a_branch2a (Conv2D)         (None, 128, 24, 48)  73856       re_lu_4[0][0]                    
__________________________________________________________________________________________________
bn3a_branch2a (BatchNormalizati (None, 128, 24, 48)  512         res3a_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_5 (ReLU)                  (None, 128, 24, 48)  0           bn3a_branch2a[0][0]              
__________________________________________________________________________________________________
res3a_branch1 (Conv2D)          (None, 128, 24, 48)  8320        re_lu_4[0][0]                    
__________________________________________________________________________________________________
res3a_branch2b (Conv2D)         (None, 128, 24, 48)  147584      re_lu_5[0][0]                    
__________________________________________________________________________________________________
bn3a_branch1 (BatchNormalizatio (None, 128, 24, 48)  512         res3a_branch1[0][0]              
__________________________________________________________________________________________________
bn3a_branch2b (BatchNormalizati (None, 128, 24, 48)  512         res3a_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_2 (TensorFlowOp [(None, 128, 24, 48) 0           bn3a_branch1[0][0]               
                                                                 bn3a_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_6 (ReLU)                  (None, 128, 24, 48)  0           tf_op_layer_add_2[0][0]          
__________________________________________________________________________________________________
res3b_branch2a (Conv2D)         (None, 128, 24, 48)  147584      re_lu_6[0][0]                    
__________________________________________________________________________________________________
bn3b_branch2a (BatchNormalizati (None, 128, 24, 48)  512         res3b_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_7 (ReLU)                  (None, 128, 24, 48)  0           bn3b_branch2a[0][0]              
__________________________________________________________________________________________________
res3b_branch2b (Conv2D)         (None, 128, 24, 48)  147584      re_lu_7[0][0]                    
__________________________________________________________________________________________________
bn3b_branch2b (BatchNormalizati (None, 128, 24, 48)  512         res3b_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_3 (TensorFlowOp [(None, 128, 24, 48) 0           re_lu_6[0][0]                    
                                                                 bn3b_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_8 (ReLU)                  (None, 128, 24, 48)  0           tf_op_layer_add_3[0][0]          
__________________________________________________________________________________________________
res4a_branch2a (Conv2D)         (None, 256, 12, 24)  295168      re_lu_8[0][0]                    
__________________________________________________________________________________________________
bn4a_branch2a (BatchNormalizati (None, 256, 12, 24)  1024        res4a_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_9 (ReLU)                  (None, 256, 12, 24)  0           bn4a_branch2a[0][0]              
__________________________________________________________________________________________________
res4a_branch1 (Conv2D)          (None, 256, 12, 24)  33024       re_lu_8[0][0]                    
__________________________________________________________________________________________________
res4a_branch2b (Conv2D)         (None, 256, 12, 24)  590080      re_lu_9[0][0]                    
__________________________________________________________________________________________________
bn4a_branch1 (BatchNormalizatio (None, 256, 12, 24)  1024        res4a_branch1[0][0]              
__________________________________________________________________________________________________
bn4a_branch2b (BatchNormalizati (None, 256, 12, 24)  1024        res4a_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_4 (TensorFlowOp [(None, 256, 12, 24) 0           bn4a_branch1[0][0]               
                                                                 bn4a_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_10 (ReLU)                 (None, 256, 12, 24)  0           tf_op_layer_add_4[0][0]          
__________________________________________________________________________________________________
res4b_branch2a (Conv2D)         (None, 256, 12, 24)  590080      re_lu_10[0][0]                   
__________________________________________________________________________________________________
bn4b_branch2a (BatchNormalizati (None, 256, 12, 24)  1024        res4b_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_11 (ReLU)                 (None, 256, 12, 24)  0           bn4b_branch2a[0][0]              
__________________________________________________________________________________________________
res4b_branch2b (Conv2D)         (None, 256, 12, 24)  590080      re_lu_11[0][0]                   
__________________________________________________________________________________________________
bn4b_branch2b (BatchNormalizati (None, 256, 12, 24)  1024        res4b_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_5 (TensorFlowOp [(None, 256, 12, 24) 0           re_lu_10[0][0]                   
                                                                 bn4b_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_12 (ReLU)                 (None, 256, 12, 24)  0           tf_op_layer_add_5[0][0]          
__________________________________________________________________________________________________
res5a_branch2a (Conv2D)         (None, 300, 12, 24)  691500      re_lu_12[0][0]                   
__________________________________________________________________________________________________
bn5a_branch2a (BatchNormalizati (None, 300, 12, 24)  1200        res5a_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_13 (ReLU)                 (None, 300, 12, 24)  0           bn5a_branch2a[0][0]              
__________________________________________________________________________________________________
res5a_branch1 (Conv2D)          (None, 300, 12, 24)  77100       re_lu_12[0][0]                   
__________________________________________________________________________________________________
res5a_branch2b (Conv2D)         (None, 300, 12, 24)  810300      re_lu_13[0][0]                   
__________________________________________________________________________________________________
bn5a_branch1 (BatchNormalizatio (None, 300, 12, 24)  1200        res5a_branch1[0][0]              
__________________________________________________________________________________________________
bn5a_branch2b (BatchNormalizati (None, 300, 12, 24)  1200        res5a_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_6 (TensorFlowOp [(None, 300, 12, 24) 0           bn5a_branch1[0][0]               
                                                                 bn5a_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_14 (ReLU)                 (None, 300, 12, 24)  0           tf_op_layer_add_6[0][0]          
__________________________________________________________________________________________________
res5b_branch2a (Conv2D)         (None, 300, 12, 24)  810300      re_lu_14[0][0]                   
__________________________________________________________________________________________________
bn5b_branch2a (BatchNormalizati (None, 300, 12, 24)  1200        res5b_branch2a[0][0]             
__________________________________________________________________________________________________
re_lu_15 (ReLU)                 (None, 300, 12, 24)  0           bn5b_branch2a[0][0]              
__________________________________________________________________________________________________
res5b_branch2b (Conv2D)         (None, 300, 12, 24)  810300      re_lu_15[0][0]                   
__________________________________________________________________________________________________
bn5b_branch2b (BatchNormalizati (None, 300, 12, 24)  1200        res5b_branch2b[0][0]             
__________________________________________________________________________________________________
tf_op_layer_add_7 (TensorFlowOp [(None, 300, 12, 24) 0           re_lu_14[0][0]                   
                                                                 bn5b_branch2b[0][0]              
__________________________________________________________________________________________________
re_lu_16 (ReLU)                 (None, 300, 12, 24)  0           tf_op_layer_add_7[0][0]          
__________________________________________________________________________________________________
permute_feature (Permute)       (None, 24, 12, 300)  0           re_lu_16[0][0]                   
__________________________________________________________________________________________________
flatten_feature (Reshape)       (None, 24, 3600)     0           permute_feature[0][0]            
__________________________________________________________________________________________________
lstm (LSTM)                     (None, 24, 512)      8423424     flatten_feature[0][0]            
__________________________________________________________________________________________________
td_dense (TimeDistributed)      (None, 24, 36)       18468       lstm[0][0]                       
__________________________________________________________________________________________________
softmax (Softmax)               (None, 24, 36)       0           td_dense[0][0]                   
==================================================================================================
Total params: 14,432,480
Trainable params: 14,424,872
Non-trainable params: 7,608
__________________________________________________________________________________________________
2021-07-09 05:36:50,237 [INFO] __main__: Number of images in the training dataset:	   111
2021-07-09 05:36:50,237 [INFO] __main__: Number of images in the validation dataset:	   110
Epoch 1/24
1/4 [======>.......................] - ETA: 21s - loss: 1.2432WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.829507). Check your callbacks.
2021-07-09 05:36:58,377 [WARNING] tensorflow: Method (on_train_batch_end) is slow compared to the batch update (0.829507). Check your callbacks.
3/4 [=====================>........] - ETA: 2s - loss: 0.8570tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Bootstrap : Using [0]lo:127.0.0.1<0> [1]eth0:10.244.58.228<0>
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO NET/IB : No device found.
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:10.244.58.228<0>
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 00/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 01/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 02/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 03/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 04/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 05/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 06/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 07/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 08/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 09/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 10/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 11/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 12/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 13/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 14/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 15/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 16/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 17/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 18/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 19/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 20/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 21/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 22/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 23/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 24/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 25/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 26/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 27/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 28/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 29/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 30/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Channel 31/32 :    0
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [1] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [2] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [3] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [4] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [5] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [6] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [7] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [8] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [9] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [10] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [11] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [12] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [13] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [14] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [15] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [16] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [17] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [18] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [19] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [20] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [21] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [22] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [23] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [24] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [25] -1/-1/-1->0->-1|-1->0->-1/-1/-1 [26] -1/-1/-1
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO Setting affinity for GPU 0 to ffffff00,0000ffff,ff000000
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer
tlt-furuse-7bc99fc9f4-mzqtk:5760:5892 [0] NCCL INFO comm 0x7fdd36f61a50 rank 0 nranks 1 cudaDev 0 busId e1000 - Init COMPLETE

Epoch 00001: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-01.tlt
4/4 [==============================] - 18s 5s/step - loss: 0.7808
Epoch 2/24
3/4 [=====================>........] - ETA: 0s - loss: 0.9575
Epoch 00002: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-02.tlt
4/4 [==============================] - 2s 459ms/step - loss: 0.7885
Epoch 3/24
3/4 [=====================>........] - ETA: 0s - loss: 0.9168
Epoch 00003: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-03.tlt
4/4 [==============================] - 1s 333ms/step - loss: 1.0077
Epoch 4/24
3/4 [=====================>........] - ETA: 0s - loss: 0.9096
Epoch 00004: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-04.tlt
4/4 [==============================] - 1s 341ms/step - loss: 0.8356
Epoch 5/24
3/4 [=====================>........] - ETA: 0s - loss: 0.4920
Epoch 00005: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-05.tlt


*******************************************
Accuracy: 98 / 110  0.8909090909090909
*******************************************


4/4 [==============================] - 6s 2s/step - loss: 0.6113
Epoch 6/24
3/4 [=====================>........] - ETA: 0s - loss: 0.7390
Epoch 00006: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-06.tlt
4/4 [==============================] - 1s 357ms/step - loss: 0.7515
Epoch 7/24
3/4 [=====================>........] - ETA: 0s - loss: 0.3931
Epoch 00007: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-07.tlt
4/4 [==============================] - 1s 345ms/step - loss: 0.3741
Epoch 8/24
3/4 [=====================>........] - ETA: 0s - loss: 0.4324
Epoch 00008: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-08.tlt
4/4 [==============================] - 1s 344ms/step - loss: 0.4426
Epoch 9/24
3/4 [=====================>........] - ETA: 0s - loss: 0.3006
Epoch 00009: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-09.tlt
4/4 [==============================] - 1s 355ms/step - loss: 0.2792
Epoch 10/24
3/4 [=====================>........] - ETA: 0s - loss: 0.5126
Epoch 00010: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-10.tlt


*******************************************
Accuracy: 100 / 110  0.9090909090909091
*******************************************


4/4 [==============================] - 4s 1s/step - loss: 0.4461
Epoch 11/24
3/4 [=====================>........] - ETA: 0s - loss: 0.4042
Epoch 00011: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-11.tlt
4/4 [==============================] - 1s 327ms/step - loss: 0.3477
Epoch 12/24
3/4 [=====================>........] - ETA: 0s - loss: 0.3892
Epoch 00012: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-12.tlt
4/4 [==============================] - 1s 329ms/step - loss: 0.3544
Epoch 13/24
3/4 [=====================>........] - ETA: 0s - loss: 0.4416
Epoch 00013: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-13.tlt
4/4 [==============================] - 1s 334ms/step - loss: 0.3674
Epoch 14/24
3/4 [=====================>........] - ETA: 0s - loss: 0.2404
Epoch 00014: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-14.tlt
4/4 [==============================] - 1s 334ms/step - loss: 0.2857
Epoch 15/24
3/4 [=====================>........] - ETA: 0s - loss: 0.6028
Epoch 00015: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-15.tlt


*******************************************
Accuracy: 100 / 110  0.9090909090909091
*******************************************


4/4 [==============================] - 4s 1s/step - loss: 0.4996
Epoch 16/24
3/4 [=====================>........] - ETA: 0s - loss: 0.3446
Epoch 00016: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-16.tlt
4/4 [==============================] - 1s 348ms/step - loss: 0.3084
Epoch 17/24
3/4 [=====================>........] - ETA: 0s - loss: 0.1858
Epoch 00017: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-17.tlt
4/4 [==============================] - 1s 339ms/step - loss: 0.2053
Epoch 18/24
3/4 [=====================>........] - ETA: 0s - loss: 0.8404
Epoch 00018: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-18.tlt
4/4 [==============================] - 1s 335ms/step - loss: 0.7287
Epoch 19/24
3/4 [=====================>........] - ETA: 0s - loss: 0.2634
Epoch 00019: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-19.tlt
4/4 [==============================] - 1s 332ms/step - loss: 0.2422
Epoch 20/24
3/4 [=====================>........] - ETA: 0s - loss: 0.5270
Epoch 00020: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-20.tlt


*******************************************
Accuracy: 100 / 110  0.9090909090909091
*******************************************


4/4 [==============================] - 4s 1s/step - loss: 0.4482
Epoch 21/24
3/4 [=====================>........] - ETA: 0s - loss: 0.4593
Epoch 00021: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-21.tlt
4/4 [==============================] - 1s 340ms/step - loss: 0.3884
Epoch 22/24
3/4 [=====================>........] - ETA: 0s - loss: 0.2618
Epoch 00022: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-22.tlt
4/4 [==============================] - 1s 334ms/step - loss: 0.2449
Epoch 23/24
3/4 [=====================>........] - ETA: 0s - loss: 0.2673
Epoch 00023: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-23.tlt
4/4 [==============================] - 1s 334ms/step - loss: 0.2799
Epoch 24/24
3/4 [=====================>........] - ETA: 0s - loss: 0.2546
Epoch 00024: saving model to /data/tlt-experiments/lprnet/experiment_dir_unpruned/weights/lprnet_epoch-24.tlt
4/4 [==============================] - 1s 333ms/step - loss: 0.2980


*******************************************
Accuracy: 101 / 110  0.9181818181818182
*******************************************

Thanks.

system · September 7, 2021, 6:00am

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error with tlt train in official Jupyter notebook TLT 3.0 TAO Toolkit	7	800	October 12, 2021
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("OCI runtime create failed...) TAO Toolkit ubuntu , docker	51	8678	December 6, 2021
Train with my own tlt model #2 TAO Toolkit	42	2778	February 8, 2022
Tlt unet evaluate failed TAO Toolkit	10	502	September 18, 2021
Transfer learnign LPRnet to recognise texts other than license plates TAO Toolkit	17	936	March 17, 2022
TLT 3.0 & WSL2 issues TAO Toolkit nvbugs	7	1235	December 6, 2021
Tao detectnet_v2 train failed with g_error_metadata.to_exception in autograph module TAO Toolkit tao	12	1393	January 10, 2022
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	591	August 27, 2021
Can't see the classification and other folder inside TLT-V3 TAO Toolkit	21	2495	October 12, 2021
Enviromental variables and docker mount error for transfer laerning using yolov4 TAO Toolkit	8	814	October 14, 2021

Can I run "tlt lprnet train" command inside a docker container

Related topics