Segmentation with unet : shape error

Hello.
I’m trying semantic segmentation with tlt v3 on custom dataset.
I use a resnet18 backbone, however when launching training i got shape error.
To quote the documentation (Data Input for Semantic Segmentation — Transfer Learning Toolkit 3.0 documentation) :

The size of the images need not necessarily be equal to the model input dimensions. The images are resized internally to model input dimensions

But the error I got is about shape mismatch ValueError: generator yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected (see full log below).

My command for training :

tlt unet train --gpus=1 \
  -e /workspace/tlt-experiments/specs/resnet18.txt \
  -r /output/runs/resnet18_run1 \
  -m /output/pretrained_resnet18/tlt_semantic_segmentation_vresnet18/resnet_18.hdf5 \
  -n resnet18_lip \
  -k $KEY

Full traceback + logs:

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-cqcmse4k because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/checkpoint_saver_hook.py:21: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.WARN is deprecated. Please use tf.compat.v1.logging.WARN instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py:389: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Loading experiment spec at /workspace/tlt-experiments/specs/resnet18.txt.
2021-04-23 13:55:05,679 [INFO] __main__: Loading experiment spec at /workspace/tlt-experiments/specs/resnet18.txt.
2021-04-23 13:55:05,681 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /workspace/tlt-experiments/specs/resnet18.txt
2021-04-23 13:55:05,690 [INFO] root: Initializing the pre-trained weights from /output/pretrained_resnet18/tlt_semantic_segmentation_vresnet18/resnet_18.hdf5
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-04-23 13:55:05,696 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-04-23 13:55:05,705 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-04-23 13:55:05,726 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2021-04-23 13:55:05,731 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

2021-04-23 13:55:06,470 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-04-23 13:55:06,761 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-04-23 13:55:06,761 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-04-23 13:55:06,920 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2021-04-23 13:55:07,386 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

INFO:tensorflow:Using config: {'_model_dir': '/output/runs/resnet18_run1', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
  allow_growth: true
  visible_device_list: "0"
  force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9c90e25208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-04-23 13:55:07,409 [INFO] tensorflow: Using config: {'_model_dir': '/output/runs/resnet18_run1', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
  allow_growth: true
  visible_device_list: "0"
  force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9c90e25208>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-04-23 13:55:07,502 [INFO] iva.unet.model.utilities: The total number of training samples 30462 and the batch size per                 GPU 64
2021-04-23 13:55:07,502 [INFO] iva.unet.model.utilities: Cannot iterate over exactly 30462 samples with a batch size of 64; each epoch will therefore take one extra step.
2021-04-23 13:55:07,502 [INFO] iva.unet.model.utilities: Steps per epoch taken: 476
Running for 1 Epochs
2021-04-23 13:55:07,502 [INFO] __main__: Running for 1 Epochs
INFO:tensorflow:Create CheckpointSaverHook.
2021-04-23 13:55:07,502 [INFO] tensorflow: Create CheckpointSaverHook.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

2021-04-23 13:55:08,384 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method Dataset._preproc_samples of <iva.unet.utils.data_loader.Dataset object at 0x7f9c90e25160>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset._preproc_samples of <iva.unet.utils.data_loader.Dataset object at 0x7f9c90e25160>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-04-23 13:55:08,420 [WARNING] tensorflow: Entity <bound method Dataset._preproc_samples of <iva.unet.utils.data_loader.Dataset object at 0x7f9c90e25160>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset._preproc_samples of <iva.unet.utils.data_loader.Dataset object at 0x7f9c90e25160>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:266: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-04-23 13:55:08,422 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:266: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

INFO:tensorflow:Calling model_fn.
2021-04-23 13:55:08,448 [INFO] tensorflow: Calling model_fn.
{'exec_mode': 'train', 'model_dir': '/output/runs/resnet18_run1', 'log_dir': None, 'batch_size': 64, 'learning_rate': 9.999999747378752e-05, 'crossvalidation_idx': None, 'max_steps': None, 'weight_decay': 3.000000026176508e-09, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'use_trt': False, 'use_xla': False, 'loss': 'cross_entropy', 'epochs': 1, 'pretrained_weights_file': None, 'unet_model': <iva.unet.model.unet_model.UnetModel object at 0x7f9acd08d668>, 'key': 'bTRybTg2YXJ0ZmludnU5Yzc1Y2dqcXVldDE6YTA4NzdlNzAtYWFjNS00MDk4LWJlNDctZjMwODZmNGIxY2Ew', 'experiment_spec': random_seed: 42
dataset_config {
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "/data1/TrainVal_images/TrainVal_images/train_images/"
  train_masks_path: "/data1/TrainVal_parsing_annotations/TrainVal_parsing_annotations/train_segmentations/"
  val_images_path: "/data1/TrainVal_images/TrainVal_images/val_images/"
  val_masks_path: "/data1/TrainVal_parsing_annotations/TrainVal_parsing_annotations/val_segmentations/"
  data_class_config {
    target_classes {
      name: "Background"
      mapping_class: "Background"
    }
    target_classes {
      name: "Hat"
      label_id: 1
      mapping_class: "Hat"
    }
    target_classes {
      name: "Hair"
      label_id: 2
      mapping_class: "Hair"
    }
    target_classes {
      name: "Glove"
      label_id: 3
      mapping_class: "Glove"
    }
    target_classes {
      name: "Sunglasses"
      label_id: 4
      mapping_class: "Sunglasses"
    }
    target_classes {
      name: "UpperClothes"
      label_id: 5
      mapping_class: "UpperClothes"
    }
    target_classes {
      name: "Dress"
      label_id: 6
      mapping_class: "Dress"
    }
    target_classes {
      name: "Coat"
      label_id: 7
      mapping_class: "Coat"
    }
    target_classes {
      name: "Socks"
      label_id: 8
      mapping_class: "Socks"
    }
    target_classes {
      name: "Pants"
      label_id: 9
      mapping_class: "Pants"
    }
    target_classes {
      name: "Jumpsuits"
      label_id: 10
      mapping_class: "Jumpsuits"
    }
    target_classes {
      name: "Scarf"
      label_id: 11
      mapping_class: "Scarf"
    }
    target_classes {
      name: "Skirt"
      label_id: 12
      mapping_class: "Skirt"
    }
    target_classes {
      name: "Face"
      label_id: 13
      mapping_class: "Face"
    }
    target_classes {
      name: "Left-arm"
      label_id: 14
      mapping_class: "Left-arm"
    }
    target_classes {
      name: "Right-arm"
      label_id: 15
      mapping_class: "Right-arm"
    }
    target_classes {
      name: "Left-leg"
      label_id: 16
      mapping_class: "Left-leg"
    }
    target_classes {
      name: "Right-leg"
      label_id: 17
      mapping_class: "Right-leg"
    }
    target_classes {
      name: "Left-shoe"
      label_id: 18
      mapping_class: "Left-shoe"
    }
    target_classes {
      name: "Right-shoe"
      label_id: 19
      mapping_class: "Right-shoe"
    }
  }
}
model_config {
  num_layers: 18
  use_batch_norm: true
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 320
  model_input_width: 320
  model_input_channels: 3
}
training_config {
  batch_size: 64
  regularizer {
    type: L2
    weight: 3.000000026176508e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 10
  learning_rate: 9.999999747378752e-05
  loss: "cross_entropy"
  epochs: 1
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmpc0a6e1w0', 'num_classes': 20, 'start_step': 0, 'checkpoint_interval': 1, 'phase': None}
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 320, 320)  0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 160, 160) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 160, 160) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 160, 160) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 80, 80)   36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 80, 80)   256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (None, 64, 80, 80)   0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 80, 80)   36928       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 80, 80)   4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 80, 80)   256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 80, 80)   256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 80, 80)   0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1a_relu (Activation)      (None, 64, 80, 80)   0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 80, 80)   36928       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 80, 80)   256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
block_1b_relu_1 (Activation)    (None, 64, 80, 80)   0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 80, 80)   36928       block_1b_relu_1[0][0]            
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (None, 64, 80, 80)   4160        block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 80, 80)   256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
block_1b_bn_shortcut (BatchNorm (None, 64, 80, 80)   256         block_1b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (None, 64, 80, 80)   0           block_1b_bn_2[0][0]              
                                                                 block_1b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1b_relu (Activation)      (None, 64, 80, 80)   0           add_2[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 40, 40)  73856       block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 40, 40)  512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (None, 128, 40, 40)  0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 40, 40)  147584      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 40, 40)  8320        block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 40, 40)  512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 40, 40)  512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 128, 40, 40)  0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2a_relu (Activation)      (None, 128, 40, 40)  0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 40, 40)  147584      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 40, 40)  512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
block_2b_relu_1 (Activation)    (None, 128, 40, 40)  0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 40, 40)  147584      block_2b_relu_1[0][0]            
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (None, 128, 40, 40)  16512       block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 40, 40)  512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
block_2b_bn_shortcut (BatchNorm (None, 128, 40, 40)  512         block_2b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (None, 128, 40, 40)  0           block_2b_bn_2[0][0]              
                                                                 block_2b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2b_relu (Activation)      (None, 128, 40, 40)  0           add_4[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 20, 20)  295168      block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 20, 20)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (None, 256, 20, 20)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 20, 20)  590080      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 20, 20)  33024       block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 20, 20)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 20, 20)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 20, 20)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (None, 256, 20, 20)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 20, 20)  590080      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 20, 20)  1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (None, 256, 20, 20)  0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 20, 20)  590080      block_3b_relu_1[0][0]            
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (None, 256, 20, 20)  65792       block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 20, 20)  1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
block_3b_bn_shortcut (BatchNorm (None, 256, 20, 20)  1024        block_3b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 20, 20)  0           block_3b_bn_2[0][0]              
                                                                 block_3b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3b_relu (Activation)      (None, 256, 20, 20)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 20, 20)  1180160     block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 20, 20)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (None, 512, 20, 20)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 20, 20)  2359808     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 20, 20)  131584      block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 20, 20)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 20, 20)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 20, 20)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (None, 512, 20, 20)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 20, 20)  2359808     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 20, 20)  2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (None, 512, 20, 20)  0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 20, 20)  2359808     block_4b_relu_1[0][0]            
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (None, 512, 20, 20)  262656      block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 20, 20)  2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
block_4b_bn_shortcut (BatchNorm (None, 512, 20, 20)  2048        block_4b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 20, 20)  0           block_4b_bn_2[0][0]              
                                                                 block_4b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4b_relu (Activation)      (None, 512, 20, 20)  0           add_8[0][0]                      
__________________________________________________________________________________________________
conv2d_transpose_1 (Conv2DTrans (None, 256, 40, 40)  2097408     block_4b_relu[0][0]              
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 384, 40, 40)  0           conv2d_transpose_1[0][0]         
                                                                 block_2a_relu[0][0]              
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 256, 40, 40)  884992      concatenate_1[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_2 (Conv2DTrans (None, 128, 80, 80)  524416      conv2d_1[0][0]                   
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 192, 80, 80)  0           conv2d_transpose_2[0][0]         
                                                                 block_1a_relu[0][0]              
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 128, 80, 80)  221312      concatenate_2[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, 64, 160, 160) 131136      conv2d_2[0][0]                   
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 128, 160, 160 0           conv2d_transpose_3[0][0]         
                                                                 bn_conv1[0][0]                   
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 64, 160, 160) 73792       concatenate_3[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_4 (Conv2DTrans (None, 64, 320, 320) 65600       conv2d_3[0][0]                   
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 67, 320, 320) 0           conv2d_transpose_4[0][0]         
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 64, 320, 320) 38656       concatenate_4[0][0]              
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 20, 320, 320) 11540       conv2d_4[0][0]                   
==================================================================================================
Total params: 15,597,140
Trainable params: 15,585,492
Non-trainable params: 11,648
__________________________________________________________________________________________________
INFO:tensorflow:Done calling model_fn.
2021-04-23 13:55:12,956 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2021-04-23 13:55:15,231 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2021-04-23 13:55:16,562 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-04-23 13:55:16,690 [INFO] tensorflow: Done running local_init_op.
[GPU] Restoring pretrained weights from: /tmp/tmppr_q3_qt/model.ckpt-1
2021-04-23 13:55:17,564 [INFO] iva.unet.hooks.pretrained_restore_hook: Pretrained weights loaded with success...

INFO:tensorflow:Saving checkpoints for step-0.
2021-04-23 13:55:21,874 [INFO] tensorflow: Saving checkpoints for step-0.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:92: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

2021-04-23 13:55:27,027 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/training_hook.py:92: The name tf.train.get_or_create_global_step is deprecated. Please use tf.compat.v1.train.get_or_create_global_step instead.

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.
Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 674, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
  (1) Invalid argument: ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.
Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 674, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
	 [[IteratorGetNext/_5081]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 403, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 397, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 298, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 217, in train_unet
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 104, in run_training_loop
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1259, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1360, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python3.6/dist-packages/six.py", line 696, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1345, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1418, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1176, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.
Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 674, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
  (1) Invalid argument: ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.
Traceback (most recent call last):

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
    ret = func(*args)

  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 674, in generator_py_func
    "of shape %s was expected." % (ret_array.shape, expected_shape))

ValueError: `generator` yielded an element of shape (185, 189, 3) where an element of shape (380, 356, 3) was expected.


	 [[{{node PyFunc}}]]
	 [[IteratorGetNext]]
	 [[IteratorGetNext/_5081]]
0 successful operations.
0 derived errors ignored.
Traceback (most recent call last):
  File "/usr/local/bin/unet", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/entrypoint/unet.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-04-23 15:55:32,027 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

PS: I tried the same with vgg16 backbone but got same error

Please check if all of the images and masks are of equal size.

https://docs.nvidia.com/metropolis/TLT/tlt-user-guide/text/open_model_architectures.html#semantic-segmentation

The train tool does not support training on images of multiple resolutions. All of the images and masks must be of equal size. However, image and masks need not be necessarily equal to model input size. The images/ masks will be resized to the model input size during training.

Oh thanks @Morganh for pointing it, I guess I misunderstood the doc, I thought it was about each image and its corresponding mask having the same size (h, w).
Nevertheless, I resized all my images and masks to a fixed size. and now I’m getting the following error :

Matplotlib created a temporary config/cache directory at /tmp/matplotlib-q5qe0toe because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/checkpoint_saver_hook.py:21: The name tf.train.CheckpointSaverHook is deprecated. Please use tf.estimator.CheckpointSaverHook instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/hooks/pretrained_restore_hook.py:23: The name tf.logging.WARN is deprecated. Please use tf.compat.v1.logging.WARN instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py:389: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

Loading experiment spec at /workspace/tlt-experiments/specs/resnet18.txt.
2021-04-23 17:01:19,894 [INFO] __main__: Loading experiment spec at /workspace/tlt-experiments/specs/resnet18.txt.
2021-04-23 17:01:19,896 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /workspace/tlt-experiments/specs/resnet18.txt
2021-04-23 17:01:19,906 [INFO] root: Initializing the pre-trained weights from /output/pretrained_resnet18/tlt_semantic_segmentation_vresnet18/resnet_18.hdf5
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-04-23 17:01:19,912 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-04-23 17:01:19,921 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-04-23 17:01:19,942 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2021-04-23 17:01:19,948 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

WARNING:tensorflow:From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

2021-04-23 17:01:20,682 [WARNING] tensorflow: From /opt/nvidia/third_party/keras/tensorflow_backend.py:187: The name tf.nn.avg_pool is deprecated. Please use tf.nn.avg_pool2d instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2021-04-23 17:01:20,990 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2021-04-23 17:01:20,991 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2021-04-23 17:01:21,147 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2021-04-23 17:01:21,629 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:95: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

INFO:tensorflow:Using config: {'_model_dir': '/output/runs/resnet18_run1', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
  allow_growth: true
  visible_device_list: "0"
  force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa1376380b8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-04-23 17:01:21,649 [INFO] tensorflow: Using config: {'_model_dir': '/output/runs/resnet18_run1', '_tf_random_seed': None, '_save_summary_steps': 5, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': intra_op_parallelism_threads: 1
inter_op_parallelism_threads: 38
gpu_options {
  allow_growth: true
  visible_device_list: "0"
  force_gpu_compatible: true
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fa1376380b8>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 403, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 397, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 298, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/train.py", line 213, in train_unet
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py", line 76, in __init__
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py", line 102, in get_input_shape
IndexError: list index out of range
Traceback (most recent call last):
  File "/usr/local/bin/unet", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/entrypoint/unet.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.
2021-04-23 19:01:23,290 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

The error is about IndexError: list index out of range, but because I can’t access the source code , I have no idea what list it is all about.
Any tips for debugging

Can you share your training spec file? And what is the resolution after “I resized all my images and masks to a fixed size”?

Sorry for the delayed response, went on weekend

I finally went for the provided container (nvcr.io/nvidia/tlt-streamanalytics:v3.0-dp-py3) as the end goal for us is to integrate the trained model into a product via containers.
I was able to successfully train a model with unet on my custom data using following command (noticed also that the command is different than running it with pure tlt) :

!unet train --gpus=1 \
              -e $SPECS_DIR/unet_train_resnet_lip.txt \
              -r $USER_EXPERIMENT_DIR/runs/resnet18_lip_run1\
              -m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_pretrained_object_detection_vresnet18/resnet_18.hdf5 \
              -n lip_model \
              -k $KEY

The spec file is the following :

random_seed: 42
model_config {
  num_layers: 18
all_projections: true
arch: "resnet"
  use_batch_norm: true
  training_precision {
    backend_floatx: FLOAT32
  }
  model_input_height: 224
  model_input_width: 224
  model_input_channels: 3
}

training_config {
  batch_size: 64
  epochs: 5
  use_xla: true
  log_summary_steps: 10
  checkpoint_interval: 1
  learning_rate:0.0001
  regularizer {
    type: L1
    weight: 3.00000002618e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993923e-09
      beta1: 0.899999976158
      beta2: 0.999000012875
    }
  }
}

dataset_config {
    dataset: "custom"
    augment: False
    input_image_type: "color"
    train_images_path: "/data/lip/small_train/imgs/"
    train_masks_path: "/data/lip/small_train/orig/"
    val_images_path: "/data/lip/small_val/imgs/"
    val_masks_path: "/data/lip/small_val/orig/"
    
    data_class_config {
         target_classes {
            name: 'Background'
            mapping_class: 'Background'
            label_id: 0
        }
        target_classes {
            name: 'Hat'
            mapping_class: 'Hat'
            label_id: 1
        }
        target_classes {
            name: 'Hair'
            mapping_class: 'Hair'
            label_id: 2
        }
        target_classes {
            name: 'Glove'
            mapping_class: 'Glove'
            label_id: 3
        }
        target_classes {
            name: 'Sunglasses'
            mapping_class: 'Sunglasses'
            label_id: 4
        }
        target_classes {
            name: 'UpperClothes'
            mapping_class: 'UpperClothes'
            label_id: 5
        }
        target_classes {
            name: 'Dress'
            mapping_class: 'Dress'
            label_id: 6
        }
        target_classes {
            name: 'Coat'
            mapping_class: 'Coat'
            label_id: 7
        }
        target_classes {
            name: 'Socks'
            mapping_class: 'Socks'
            label_id: 8
        }
        target_classes {
            name: 'Pants'
            mapping_class: 'Pants'
            label_id: 9
        }
        target_classes {
            name: 'Jumpsuits'
            mapping_class: 'Jumpsuits'
            label_id: 10
        }
        target_classes {
            name: 'Scarf'
            mapping_class: 'Scarf'
            label_id: 11
        }
        target_classes {
            name: 'Skirt'
            mapping_class: 'Skirt'
            label_id: 12
        }
        target_classes {
            name: 'Face'
            mapping_class: 'Face'
            label_id: 13
        }
        target_classes {
            name: 'Left-arm'
            mapping_class: 'Left-arm'
            label_id: 14
        }
        target_classes {
            name: 'Right-arm'
            mapping_class: 'Right-arm'
            label_id: 15
        }
        target_classes {
            name: 'Left-leg'
            mapping_class: 'Left-leg'
            label_id: 16
        }
        target_classes {
            name: 'Right-leg'
            mapping_class: 'Right-leg'
            label_id: 17
        }
        target_classes {
            name: 'Left-shoe'
            mapping_class: 'Left-shoe'
            label_id: 18
        }
        target_classes {
            name: 'Right-shoe'
            mapping_class: 'Right-shoe'
            label_id: 19
        }
    }
}

The training ran successfully, however when running validation I got error below. The error is about a mask image file not found that is not. I remarked that the script is trying to find a mask file with extension .jpg whereas all my masks are .png extensions. Which is surprising given that validation images/masks follow the same pattern as training ones: same file name but images with .jpg extension and masks with .png.
The file that is not found exists effectively but with .png instead extension.

Full traceback :

Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py:44: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.

WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py:44: The name tf.logging.INFO is deprecated. Please use tf.compat.v1.logging.INFO instead.

2021-04-26 09:04:06,350 [INFO] __main__: Loading experiment spec at /data/tlt-experiments/segmentation/runs/resnet18_lip_run1/experiment_spec.txt.
2021-04-26 09:04:06,351 [INFO] iva.unet.spec_handler.spec_loader: Merging specification from /data/tlt-experiments/segmentation/runs/resnet18_lip_run1/experiment_spec.txt
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2021-04-26 09:04:06,358 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

INFO:tensorflow:Using config: {'_model_dir': '/data/tlt-experiments/segmentation/runs/resnet18_lip_run1/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8afe36c978>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-04-26 09:04:06,388 [INFO] tensorflow: Using config: {'_model_dir': '/data/tlt-experiments/segmentation/runs/resnet18_lip_run1/weights', '_tf_random_seed': None, '_save_summary_steps': 1, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': gpu_options {
}
allow_soft_placement: true
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f8afe36c978>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2021-04-26 09:04:06,388 [INFO] iva.unet.model.model_io: Loading weights from /data/tlt-experiments/segmentation/runs/resnet18_lip_run1/weights/lip_model.tlt
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

2021-04-26 09:04:09,335 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method Dataset._normalize_inputs of <iva.unet.utils.data_loader.Dataset object at 0x7f8afe3ebb38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset._normalize_inputs of <iva.unet.utils.data_loader.Dataset object at 0x7f8afe3ebb38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2021-04-26 09:04:09,370 [WARNING] tensorflow: Entity <bound method Dataset._normalize_inputs of <iva.unet.utils.data_loader.Dataset object at 0x7f8afe3ebb38>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Dataset._normalize_inputs of <iva.unet.utils.data_loader.Dataset object at 0x7f8afe3ebb38>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:266: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

2021-04-26 09:04:09,372 [WARNING] tensorflow: From /home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py:266: The name tf.image.resize_images is deprecated. Please use tf.image.resize instead.

INFO:tensorflow:Calling model_fn.
2021-04-26 09:04:09,385 [INFO] tensorflow: Calling model_fn.
{'exec_mode': 'train', 'model_dir': '/data/tlt-experiments/segmentation/runs/resnet18_lip_run1/weights', 'log_dir': None, 'batch_size': 64, 'learning_rate': 9.999999747378752e-05, 'crossvalidation_idx': None, 'max_steps': None, 'weight_decay': 3.000000026176508e-09, 'log_summary_steps': 10, 'warmup_steps': 0, 'augment': False, 'use_amp': False, 'use_trt': False, 'use_xla': True, 'loss': 'cross_entropy', 'epochs': 5, 'pretrained_weights_file': None, 'unet_model': <iva.unet.model.unet_model.UnetModel object at 0x7f8a6fe44dd8>, 'key': 'bTRybTg2YXJ0ZmludnU5Yzc1Y2dqcXVldDE6YTA4NzdlNzAtYWFjNS00MDk4LWJlNDctZjMwODZmNGIxY2Ew', 'experiment_spec': random_seed: 42
dataset_config {
  dataset: "custom"
  input_image_type: "color"
  train_images_path: "/data/lip/small_train/imgs/"
  train_masks_path: "/data/lip/small_train/orig/"
  val_images_path: "/data/lip/small_val/imgs/"
  val_masks_path: "/data/lip/small_val/orig/"
  data_class_config {
    target_classes {
      name: "Background"
      mapping_class: "Background"
    }
    target_classes {
      name: "Hat"
      label_id: 1
      mapping_class: "Hat"
    }
    target_classes {
      name: "Hair"
      label_id: 2
      mapping_class: "Hair"
    }
    target_classes {
      name: "Glove"
      label_id: 3
      mapping_class: "Glove"
    }
    target_classes {
      name: "Sunglasses"
      label_id: 4
      mapping_class: "Sunglasses"
    }
    target_classes {
      name: "UpperClothes"
      label_id: 5
      mapping_class: "UpperClothes"
    }
    target_classes {
      name: "Dress"
      label_id: 6
      mapping_class: "Dress"
    }
    target_classes {
      name: "Coat"
      label_id: 7
      mapping_class: "Coat"
    }
    target_classes {
      name: "Socks"
      label_id: 8
      mapping_class: "Socks"
    }
    target_classes {
      name: "Pants"
      label_id: 9
      mapping_class: "Pants"
    }
    target_classes {
      name: "Jumpsuits"
      label_id: 10
      mapping_class: "Jumpsuits"
    }
    target_classes {
      name: "Scarf"
      label_id: 11
      mapping_class: "Scarf"
    }
    target_classes {
      name: "Skirt"
      label_id: 12
      mapping_class: "Skirt"
    }
    target_classes {
      name: "Face"
      label_id: 13
      mapping_class: "Face"
    }
    target_classes {
      name: "Left-arm"
      label_id: 14
      mapping_class: "Left-arm"
    }
    target_classes {
      name: "Right-arm"
      label_id: 15
      mapping_class: "Right-arm"
    }
    target_classes {
      name: "Left-leg"
      label_id: 16
      mapping_class: "Left-leg"
    }
    target_classes {
      name: "Right-leg"
      label_id: 17
      mapping_class: "Right-leg"
    }
    target_classes {
      name: "Left-shoe"
      label_id: 18
      mapping_class: "Left-shoe"
    }
    target_classes {
      name: "Right-shoe"
      label_id: 19
      mapping_class: "Right-shoe"
    }
  }
}
model_config {
  num_layers: 18
  use_batch_norm: true
  training_precision {
    backend_floatx: FLOAT32
  }
  arch: "resnet"
  all_projections: true
  model_input_height: 224
  model_input_width: 224
  model_input_channels: 3
}
training_config {
  batch_size: 64
  regularizer {
    type: L1
    weight: 3.000000026176508e-09
  }
  optimizer {
    adam {
      epsilon: 9.99999993922529e-09
      beta1: 0.8999999761581421
      beta2: 0.9990000128746033
    }
  }
  checkpoint_interval: 1
  log_summary_steps: 10
  use_xla: true
  learning_rate: 9.999999747378752e-05
  epochs: 5
}
, 'seed': 42, 'benchmark': False, 'temp_dir': '/tmp/tmp2kt264fe', 'num_classes': 20, 'start_step': 0, 'checkpoint_interval': 1, 'phase': None}
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2021-04-26 09:04:09,386 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2021-04-26 09:04:09,387 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2021-04-26 09:04:09,542 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

2021-04-26 09:04:09,547 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 3, 224, 224)  0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (None, 64, 112, 112) 9472        input_1[0][0]                    
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (None, 64, 112, 112) 256         conv1[0][0]                      
__________________________________________________________________________________________________
activation_1 (Activation)       (None, 64, 112, 112) 0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (None, 64, 56, 56)   36928       activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (None, 64, 56, 56)   256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (None, 64, 56, 56)   0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (None, 64, 56, 56)   36928       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (None, 64, 56, 56)   4160        activation_1[0][0]               
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (None, 64, 56, 56)   256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (None, 64, 56, 56)   256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (None, 64, 56, 56)   0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1a_relu (Activation)      (None, 64, 56, 56)   0           add_1[0][0]                      
__________________________________________________________________________________________________
block_1b_conv_1 (Conv2D)        (None, 64, 56, 56)   36928       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_1 (BatchNormalizati (None, 64, 56, 56)   256         block_1b_conv_1[0][0]            
__________________________________________________________________________________________________
block_1b_relu_1 (Activation)    (None, 64, 56, 56)   0           block_1b_bn_1[0][0]              
__________________________________________________________________________________________________
block_1b_conv_2 (Conv2D)        (None, 64, 56, 56)   36928       block_1b_relu_1[0][0]            
__________________________________________________________________________________________________
block_1b_conv_shortcut (Conv2D) (None, 64, 56, 56)   4160        block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_1b_bn_2 (BatchNormalizati (None, 64, 56, 56)   256         block_1b_conv_2[0][0]            
__________________________________________________________________________________________________
block_1b_bn_shortcut (BatchNorm (None, 64, 56, 56)   256         block_1b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (None, 64, 56, 56)   0           block_1b_bn_2[0][0]              
                                                                 block_1b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1b_relu (Activation)      (None, 64, 56, 56)   0           add_2[0][0]                      
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (None, 128, 28, 28)  73856       block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (None, 128, 28, 28)  512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (None, 128, 28, 28)  0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (None, 128, 28, 28)  147584      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (None, 128, 28, 28)  8320        block_1b_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (None, 128, 28, 28)  512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (None, 128, 28, 28)  512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (None, 128, 28, 28)  0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2a_relu (Activation)      (None, 128, 28, 28)  0           add_3[0][0]                      
__________________________________________________________________________________________________
block_2b_conv_1 (Conv2D)        (None, 128, 28, 28)  147584      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_1 (BatchNormalizati (None, 128, 28, 28)  512         block_2b_conv_1[0][0]            
__________________________________________________________________________________________________
block_2b_relu_1 (Activation)    (None, 128, 28, 28)  0           block_2b_bn_1[0][0]              
__________________________________________________________________________________________________
block_2b_conv_2 (Conv2D)        (None, 128, 28, 28)  147584      block_2b_relu_1[0][0]            
__________________________________________________________________________________________________
block_2b_conv_shortcut (Conv2D) (None, 128, 28, 28)  16512       block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_2b_bn_2 (BatchNormalizati (None, 128, 28, 28)  512         block_2b_conv_2[0][0]            
__________________________________________________________________________________________________
block_2b_bn_shortcut (BatchNorm (None, 128, 28, 28)  512         block_2b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_4 (Add)                     (None, 128, 28, 28)  0           block_2b_bn_2[0][0]              
                                                                 block_2b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2b_relu (Activation)      (None, 128, 28, 28)  0           add_4[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (None, 256, 14, 14)  295168      block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (None, 256, 14, 14)  1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (None, 256, 14, 14)  0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (None, 256, 14, 14)  590080      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (None, 256, 14, 14)  33024       block_2b_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (None, 256, 14, 14)  1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (None, 256, 14, 14)  1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_5 (Add)                     (None, 256, 14, 14)  0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (None, 256, 14, 14)  0           add_5[0][0]                      
__________________________________________________________________________________________________
block_3b_conv_1 (Conv2D)        (None, 256, 14, 14)  590080      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_1 (BatchNormalizati (None, 256, 14, 14)  1024        block_3b_conv_1[0][0]            
__________________________________________________________________________________________________
block_3b_relu_1 (Activation)    (None, 256, 14, 14)  0           block_3b_bn_1[0][0]              
__________________________________________________________________________________________________
block_3b_conv_2 (Conv2D)        (None, 256, 14, 14)  590080      block_3b_relu_1[0][0]            
__________________________________________________________________________________________________
block_3b_conv_shortcut (Conv2D) (None, 256, 14, 14)  65792       block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_3b_bn_2 (BatchNormalizati (None, 256, 14, 14)  1024        block_3b_conv_2[0][0]            
__________________________________________________________________________________________________
block_3b_bn_shortcut (BatchNorm (None, 256, 14, 14)  1024        block_3b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_6 (Add)                     (None, 256, 14, 14)  0           block_3b_bn_2[0][0]              
                                                                 block_3b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3b_relu (Activation)      (None, 256, 14, 14)  0           add_6[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (None, 512, 14, 14)  1180160     block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (None, 512, 14, 14)  2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (None, 512, 14, 14)  0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (None, 512, 14, 14)  2359808     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (None, 512, 14, 14)  131584      block_3b_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (None, 512, 14, 14)  2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (None, 512, 14, 14)  2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_7 (Add)                     (None, 512, 14, 14)  0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (None, 512, 14, 14)  0           add_7[0][0]                      
__________________________________________________________________________________________________
block_4b_conv_1 (Conv2D)        (None, 512, 14, 14)  2359808     block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_1 (BatchNormalizati (None, 512, 14, 14)  2048        block_4b_conv_1[0][0]            
__________________________________________________________________________________________________
block_4b_relu_1 (Activation)    (None, 512, 14, 14)  0           block_4b_bn_1[0][0]              
__________________________________________________________________________________________________
block_4b_conv_2 (Conv2D)        (None, 512, 14, 14)  2359808     block_4b_relu_1[0][0]            
__________________________________________________________________________________________________
block_4b_conv_shortcut (Conv2D) (None, 512, 14, 14)  262656      block_4a_relu[0][0]              
__________________________________________________________________________________________________
block_4b_bn_2 (BatchNormalizati (None, 512, 14, 14)  2048        block_4b_conv_2[0][0]            
__________________________________________________________________________________________________
block_4b_bn_shortcut (BatchNorm (None, 512, 14, 14)  2048        block_4b_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_8 (Add)                     (None, 512, 14, 14)  0           block_4b_bn_2[0][0]              
                                                                 block_4b_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4b_relu (Activation)      (None, 512, 14, 14)  0           add_8[0][0]                      
__________________________________________________________________________________________________
conv2d_transpose_1 (Conv2DTrans (None, 256, 28, 28)  2097408     block_4b_relu[0][0]              
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 384, 28, 28)  0           conv2d_transpose_1[0][0]         
                                                                 block_2a_relu[0][0]              
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 256, 28, 28)  884992      concatenate_1[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_2 (Conv2DTrans (None, 128, 56, 56)  524416      conv2d_1[0][0]                   
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, 192, 56, 56)  0           conv2d_transpose_2[0][0]         
                                                                 block_1a_relu[0][0]              
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 128, 56, 56)  221312      concatenate_2[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_3 (Conv2DTrans (None, 64, 112, 112) 131136      conv2d_2[0][0]                   
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, 128, 112, 112 0           conv2d_transpose_3[0][0]         
                                                                 bn_conv1[0][0]                   
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 64, 112, 112) 73792       concatenate_3[0][0]              
__________________________________________________________________________________________________
conv2d_transpose_4 (Conv2DTrans (None, 64, 224, 224) 65600       conv2d_3[0][0]                   
__________________________________________________________________________________________________
concatenate_4 (Concatenate)     (None, 67, 224, 224) 0           conv2d_transpose_4[0][0]         
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_4 (Conv2D)               (None, 64, 224, 224) 38656       concatenate_4[0][0]              
__________________________________________________________________________________________________
conv2d_5 (Conv2D)               (None, 20, 224, 224) 11540       conv2d_4[0][0]                   
==================================================================================================
Total params: 15,597,140
Trainable params: 15,585,492
Non-trainable params: 11,648
__________________________________________________________________________________________________
INFO:tensorflow:Done calling model_fn.
2021-04-26 09:04:11,012 [INFO] tensorflow: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2021-04-26 09:04:11,292 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpzbjuu6re/model.ckpt-5
2021-04-26 09:04:11,770 [INFO] tensorflow: Restoring parameters from /tmp/tmpzbjuu6re/model.ckpt-5
INFO:tensorflow:Running local_init_op.
2021-04-26 09:04:12,343 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2021-04-26 09:04:12,381 [INFO] tensorflow: Done running local_init_op.
0it [00:00, ?it/s]
Traceback (most recent call last):
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 345, in <module>
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 341, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 249, in run_experiment
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 209, in evaluate_unet
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 148, in run_evaluate_tlt
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 84, in print_compute_metrics
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/scripts/evaluate.py", line 60, in compute_metrics
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/utils/data_loader.py", line 183, in get_mask_arr_from_image
  File "/usr/local/lib/python3.6/dist-packages/PIL/Image.py", line 2766, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/data/lip/small_val/orig/10024_490664.jpg'
Traceback (most recent call last):
  File "/usr/local/bin/unet", line 8, in <module>
    sys.exit(main())
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/unet/entrypoint/unet.py", line 12, in main
  File "/home/vpraveen/.cache/dazel/_dazel_vpraveen/216c8b41e526c3295d3b802489ac2034/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/entrypoint/entrypoint.py", line 296, in launch_job
AssertionError: Process run failed.

The evaluation command :

!unet evaluate --gpu_index=$GPU_INDEX -e $SPECS_DIR/unet_train_resnet_lip.txt \
                 -m $USER_EXPERIMENT_DIR/runs/resnet18_lip_run1/weights/lip_model.tlt \
                 -o $USER_EXPERIMENT_DIR/runs/resnet18_lip_run1/ \
                 -k $KEY

Thanks for the details. Internal team will check further.

The evaluation cannot handle different extension of images and masks. Next release will support it.
For workaround, please convert the masks to the same extension as images.

Ok thanks. I just renamed all masks files from .png to .jpg and it worked. I tried converting them from png to jpg using various method (opencv, PIL) but all these methods failed because they change the pixel values of the masks which is not desired.