TAO-5 Mask-rcnn converting tlt to uff instead of onnx

Dear @Morganh,

I am training a model using Mask R-CNN. After training the model, I am trying to convert the .tlt file to .onnx, but it is generating a .uff file instead. I also tried using --onnx_route tf2onnx in the export command, but it didn’t work. Could you please suggest how we can directly get the .onnx file? I am unable to convert .uff to .onnx.

I am using TAO version 5.5 on a machine with an NVIDIA 2080 Ti GPU and NVIDIA driver version 535.183.01.

# tao <task> export will fail if .onnx already exists. So we clear the export folder before tao <task> export
#!rm -rf $LOCAL_EXPERIMENT_DIR/export
#!mkdir -p $LOCAL_EXPERIMENT_DIR/export 

# Generate .onnx file using tao container
!tao model mask_rcnn export -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned/model.epoch-5.tlt \
                      -e $SPECS_DIR/maskrcnn_train_resnet10.txt \
                      --gen_ds_config
2025-01-09 18:16:56,921 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-09 18:16:56,978 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5
2025-01-09 18:16:57,006 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
2025-01-09 12:46:57.547332: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2025-01-09 12:46:57,581 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
2025-01-09 12:46:58.734242: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
Using TensorFlow backend.
2025-01-09 12:46:58,826 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:46:58,848 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:46:58,851 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:46:59,067 [TAO Toolkit] [WARNING] matplotlib 500: Matplotlib created a temporary config/cache directory at /tmp/matplotlib-ce77zrex because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
2025-01-09 12:46:59,200 [TAO Toolkit] [INFO] matplotlib.font_manager 1633: generated new fontManager
2025-01-09 12:46:59.746133: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libnvinfer.so.8
2025-01-09 12:46:59.757967: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcuda.so.1
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:47:01,131 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:47:01,153 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
WARNING:tensorflow:TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:47:01,156 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
2025-01-09 12:47:01,473 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.common.export.app 264: Saving exported model to /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-5.uff
2025-01-09 12:47:01,473 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.mask_rcnn.utils.spec_loader 47: Loading specification from /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_train_resnet10.txt
2025-01-09 12:47:01,474 [TAO Toolkit] [INFO] root 2082: Loading weights from /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-5.tlt
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpyu1yaprq', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7cfe22221610>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2025-01-09 12:47:01,705 [TAO Toolkit] [INFO] tensorflow 212: Using config: {'_model_dir': '/tmp/tmpyu1yaprq', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7cfe22221610>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
INFO:tensorflow:Create CheckpointSaverHook.
2025-01-09 12:47:01,705 [TAO Toolkit] [INFO] tensorflow 541: Create CheckpointSaverHook.
[MaskRCNN] INFO    : [*] Limiting the amount of sample to: 5000
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2025-01-09 12:47:01,745 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/third_party/keras/tensorflow_backend.py:361: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

2025-01-09 12:47:01,752 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
2025-01-09 12:47:02,434 [TAO Toolkit] [WARNING] tensorflow 1776: The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
2025-01-09 12:47:02,436 [TAO Toolkit] [WARNING] tensorflow 1776: The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
2025-01-09 12:47:02,439 [TAO Toolkit] [WARNING] tensorflow 1776: The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
WARNING:tensorflow:The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
2025-01-09 12:47:02,442 [TAO Toolkit] [WARNING] tensorflow 1776: The operation `tf.image.convert_image_dtype` will be skipped since the input and output dtypes are identical.
INFO:tensorflow:Calling model_fn.
2025-01-09 12:47:03,051 [TAO Toolkit] [INFO] tensorflow 1148: Calling model_fn.
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : Loading model graph...
[MaskRCNN] INFO    : ***********************
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_2/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_3/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_4/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_5/
[MaskRCNN] INFO    : [ROI OPs] Using Batched NMS... Scope: MLP/multilevel_propose_rois/level_6/
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
image_input (ImageInput)        [(8, 3, 640, 640)]   0                                            
__________________________________________________________________________________________________
conv1 (Conv2D)                  (8, 64, 320, 320)    9408        image_input[0][0]                
__________________________________________________________________________________________________
bn_conv1 (BatchNormalization)   (8, 64, 320, 320)    256         conv1[0][0]                      
__________________________________________________________________________________________________
activation (Activation)         (8, 64, 320, 320)    0           bn_conv1[0][0]                   
__________________________________________________________________________________________________
max_pooling2d (MaxPooling2D)    (8, 64, 160, 160)    0           activation[0][0]                 
__________________________________________________________________________________________________
block_1a_conv_1 (Conv2D)        (8, 64, 160, 160)    36864       max_pooling2d[0][0]              
__________________________________________________________________________________________________
block_1a_bn_1 (BatchNormalizati (8, 64, 160, 160)    256         block_1a_conv_1[0][0]            
__________________________________________________________________________________________________
block_1a_relu_1 (Activation)    (8, 64, 160, 160)    0           block_1a_bn_1[0][0]              
__________________________________________________________________________________________________
block_1a_conv_2 (Conv2D)        (8, 64, 160, 160)    36864       block_1a_relu_1[0][0]            
__________________________________________________________________________________________________
block_1a_conv_shortcut (Conv2D) (8, 64, 160, 160)    4096        max_pooling2d[0][0]              
__________________________________________________________________________________________________
block_1a_bn_2 (BatchNormalizati (8, 64, 160, 160)    256         block_1a_conv_2[0][0]            
__________________________________________________________________________________________________
block_1a_bn_shortcut (BatchNorm (8, 64, 160, 160)    256         block_1a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add (Add)                       (8, 64, 160, 160)    0           block_1a_bn_2[0][0]              
                                                                 block_1a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_1a_relu (Activation)      (8, 64, 160, 160)    0           add[0][0]                        
__________________________________________________________________________________________________
block_2a_conv_1 (Conv2D)        (8, 128, 80, 80)     73728       block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_1 (BatchNormalizati (8, 128, 80, 80)     512         block_2a_conv_1[0][0]            
__________________________________________________________________________________________________
block_2a_relu_1 (Activation)    (8, 128, 80, 80)     0           block_2a_bn_1[0][0]              
__________________________________________________________________________________________________
block_2a_conv_2 (Conv2D)        (8, 128, 80, 80)     147456      block_2a_relu_1[0][0]            
__________________________________________________________________________________________________
block_2a_conv_shortcut (Conv2D) (8, 128, 80, 80)     8192        block_1a_relu[0][0]              
__________________________________________________________________________________________________
block_2a_bn_2 (BatchNormalizati (8, 128, 80, 80)     512         block_2a_conv_2[0][0]            
__________________________________________________________________________________________________
block_2a_bn_shortcut (BatchNorm (8, 128, 80, 80)     512         block_2a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_1 (Add)                     (8, 128, 80, 80)     0           block_2a_bn_2[0][0]              
                                                                 block_2a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_2a_relu (Activation)      (8, 128, 80, 80)     0           add_1[0][0]                      
__________________________________________________________________________________________________
block_3a_conv_1 (Conv2D)        (8, 256, 40, 40)     294912      block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_1 (BatchNormalizati (8, 256, 40, 40)     1024        block_3a_conv_1[0][0]            
__________________________________________________________________________________________________
block_3a_relu_1 (Activation)    (8, 256, 40, 40)     0           block_3a_bn_1[0][0]              
__________________________________________________________________________________________________
block_3a_conv_2 (Conv2D)        (8, 256, 40, 40)     589824      block_3a_relu_1[0][0]            
__________________________________________________________________________________________________
block_3a_conv_shortcut (Conv2D) (8, 256, 40, 40)     32768       block_2a_relu[0][0]              
__________________________________________________________________________________________________
block_3a_bn_2 (BatchNormalizati (8, 256, 40, 40)     1024        block_3a_conv_2[0][0]            
__________________________________________________________________________________________________
block_3a_bn_shortcut (BatchNorm (8, 256, 40, 40)     1024        block_3a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_2 (Add)                     (8, 256, 40, 40)     0           block_3a_bn_2[0][0]              
                                                                 block_3a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_3a_relu (Activation)      (8, 256, 40, 40)     0           add_2[0][0]                      
__________________________________________________________________________________________________
block_4a_conv_1 (Conv2D)        (8, 512, 20, 20)     1179648     block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_1 (BatchNormalizati (8, 512, 20, 20)     2048        block_4a_conv_1[0][0]            
__________________________________________________________________________________________________
block_4a_relu_1 (Activation)    (8, 512, 20, 20)     0           block_4a_bn_1[0][0]              
__________________________________________________________________________________________________
block_4a_conv_2 (Conv2D)        (8, 512, 20, 20)     2359296     block_4a_relu_1[0][0]            
__________________________________________________________________________________________________
block_4a_conv_shortcut (Conv2D) (8, 512, 20, 20)     131072      block_3a_relu[0][0]              
__________________________________________________________________________________________________
block_4a_bn_2 (BatchNormalizati (8, 512, 20, 20)     2048        block_4a_conv_2[0][0]            
__________________________________________________________________________________________________
block_4a_bn_shortcut (BatchNorm (8, 512, 20, 20)     2048        block_4a_conv_shortcut[0][0]     
__________________________________________________________________________________________________
add_3 (Add)                     (8, 512, 20, 20)     0           block_4a_bn_2[0][0]              
                                                                 block_4a_bn_shortcut[0][0]       
__________________________________________________________________________________________________
block_4a_relu (Activation)      (8, 512, 20, 20)     0           add_3[0][0]                      
__________________________________________________________________________________________________
l5 (Conv2D)                     (8, 256, 20, 20)     131328      block_4a_relu[0][0]              
__________________________________________________________________________________________________
l4 (Conv2D)                     (8, 256, 40, 40)     65792       block_3a_relu[0][0]              
__________________________________________________________________________________________________
FPN_up_4 (UpSampling2D)         (8, 256, 40, 40)     0           l5[0][0]                         
__________________________________________________________________________________________________
FPN_add_4 (Add)                 (8, 256, 40, 40)     0           l4[0][0]                         
                                                                 FPN_up_4[0][0]                   
__________________________________________________________________________________________________
l3 (Conv2D)                     (8, 256, 80, 80)     33024       block_2a_relu[0][0]              
__________________________________________________________________________________________________
FPN_up_3 (UpSampling2D)         (8, 256, 80, 80)     0           FPN_add_4[0][0]                  
__________________________________________________________________________________________________
FPN_add_3 (Add)                 (8, 256, 80, 80)     0           l3[0][0]                         
                                                                 FPN_up_3[0][0]                   
__________________________________________________________________________________________________
l2 (Conv2D)                     (8, 256, 160, 160)   16640       block_1a_relu[0][0]              
__________________________________________________________________________________________________
FPN_up_2 (UpSampling2D)         (8, 256, 160, 160)   0           FPN_add_3[0][0]                  
__________________________________________________________________________________________________
FPN_add_2 (Add)                 (8, 256, 160, 160)   0           l2[0][0]                         
                                                                 FPN_up_2[0][0]                   
__________________________________________________________________________________________________
post_hoc_d5 (Conv2D)            (8, 256, 20, 20)     590080      l5[0][0]                         
__________________________________________________________________________________________________
post_hoc_d2 (Conv2D)            (8, 256, 160, 160)   590080      FPN_add_2[0][0]                  
__________________________________________________________________________________________________
post_hoc_d3 (Conv2D)            (8, 256, 80, 80)     590080      FPN_add_3[0][0]                  
__________________________________________________________________________________________________
post_hoc_d4 (Conv2D)            (8, 256, 40, 40)     590080      FPN_add_4[0][0]                  
__________________________________________________________________________________________________
p6 (MaxPooling2D)               (8, 256, 10, 10)     0           post_hoc_d5[0][0]                
__________________________________________________________________________________________________
rpn (Conv2D)                    multiple             590080      post_hoc_d2[0][0]                
                                                                 post_hoc_d3[0][0]                
                                                                 post_hoc_d4[0][0]                
                                                                 post_hoc_d5[0][0]                
                                                                 p6[0][0]                         
__________________________________________________________________________________________________
rpn-class (Conv2D)              multiple             771         rpn[0][0]                        
                                                                 rpn[1][0]                        
                                                                 rpn[2][0]                        
                                                                 rpn[3][0]                        
                                                                 rpn[4][0]                        
__________________________________________________________________________________________________
rpn-box (Conv2D)                multiple             3084        rpn[0][0]                        
                                                                 rpn[1][0]                        
                                                                 rpn[2][0]                        
                                                                 rpn[3][0]                        
                                                                 rpn[4][0]                        
__________________________________________________________________________________________________
permute (Permute)               (8, 160, 160, 3)     0           rpn-class[0][0]                  
__________________________________________________________________________________________________
permute_2 (Permute)             (8, 80, 80, 3)       0           rpn-class[1][0]                  
__________________________________________________________________________________________________
permute_4 (Permute)             (8, 40, 40, 3)       0           rpn-class[2][0]                  
__________________________________________________________________________________________________
permute_6 (Permute)             (8, 20, 20, 3)       0           rpn-class[3][0]                  
__________________________________________________________________________________________________
permute_8 (Permute)             (8, 10, 10, 3)       0           rpn-class[4][0]                  
__________________________________________________________________________________________________
permute_1 (Permute)             (8, 160, 160, 12)    0           rpn-box[0][0]                    
__________________________________________________________________________________________________
permute_3 (Permute)             (8, 80, 80, 12)      0           rpn-box[1][0]                    
__________________________________________________________________________________________________
permute_5 (Permute)             (8, 40, 40, 12)      0           rpn-box[2][0]                    
__________________________________________________________________________________________________
permute_7 (Permute)             (8, 20, 20, 12)      0           rpn-box[3][0]                    
__________________________________________________________________________________________________
permute_9 (Permute)             (8, 10, 10, 12)      0           rpn-box[4][0]                    
__________________________________________________________________________________________________
anchor_layer (AnchorLayer)      OrderedDict([(2, (16 0           image_input[0][0]                
__________________________________________________________________________________________________
info_input (InfoInput)          [(8, 5)]             0                                            
__________________________________________________________________________________________________
MLP (MultilevelProposal)        ((8, 1000), (8, 1000 0           permute[0][0]                    
                                                                 permute_2[0][0]                  
                                                                 permute_4[0][0]                  
                                                                 permute_6[0][0]                  
                                                                 permute_8[0][0]                  
                                                                 permute_1[0][0]                  
                                                                 permute_3[0][0]                  
                                                                 permute_5[0][0]                  
                                                                 permute_7[0][0]                  
                                                                 permute_9[0][0]                  
                                                                 anchor_layer[0][0]               
                                                                 anchor_layer[0][1]               
                                                                 anchor_layer[0][2]               
                                                                 anchor_layer[0][3]               
                                                                 anchor_layer[0][4]               
                                                                 info_input[0][0]                 
__________________________________________________________________________________________________
multilevel_crop_resize (Multile (8, 1000, 256, 7, 7) 0           post_hoc_d2[0][0]                
                                                                 post_hoc_d3[0][0]                
                                                                 post_hoc_d4[0][0]                
                                                                 post_hoc_d5[0][0]                
                                                                 p6[0][0]                         
                                                                 MLP[0][1]                        
__________________________________________________________________________________________________
box_head_reshape1 (ReshapeLayer (8000, 12544)        0           multilevel_crop_resize[0][0]     
__________________________________________________________________________________________________
fc6 (Dense)                     (8000, 1024)         12846080    box_head_reshape1[0][0]          
__________________________________________________________________________________________________
fc7 (Dense)                     (8000, 1024)         1049600     fc6[0][0]                        
__________________________________________________________________________________________________
class-predict (Dense)           (8000, 2)            2050        fc7[0][0]                        
__________________________________________________________________________________________________
box-predict (Dense)             (8000, 8)            8200        fc7[0][0]                        
__________________________________________________________________________________________________
box_head_reshape2 (ReshapeLayer (8, 1000, 2)         0           class-predict[0][0]              
__________________________________________________________________________________________________
box_head_reshape3 (ReshapeLayer (8, 1000, 8)         0           box-predict[0][0]                
__________________________________________________________________________________________________
gpu_detections (GPUDetections)  ((8,), (8, 100, 4),  0           box_head_reshape2[0][0]          
                                                                 box_head_reshape3[0][0]          
                                                                 MLP[0][1]                        
                                                                 info_input[0][0]                 
__________________________________________________________________________________________________
multilevel_crop_resize_1 (Multi (8, 100, 256, 14, 14 0           post_hoc_d2[0][0]                
                                                                 post_hoc_d3[0][0]                
                                                                 post_hoc_d4[0][0]                
                                                                 post_hoc_d5[0][0]                
                                                                 p6[0][0]                         
                                                                 gpu_detections[0][1]             
__________________________________________________________________________________________________
mask_head_reshape_1 (ReshapeLay (800, 256, 14, 14)   0           multilevel_crop_resize_1[0][0]   
__________________________________________________________________________________________________
mask-conv-l0 (Conv2D)           (800, 256, 14, 14)   590080      mask_head_reshape_1[0][0]        
__________________________________________________________________________________________________
mask-conv-l1 (Conv2D)           (800, 256, 14, 14)   590080      mask-conv-l0[0][0]               
__________________________________________________________________________________________________
mask-conv-l2 (Conv2D)           (800, 256, 14, 14)   590080      mask-conv-l1[0][0]               
__________________________________________________________________________________________________
mask-conv-l3 (Conv2D)           (800, 256, 14, 14)   590080      mask-conv-l2[0][0]               
__________________________________________________________________________________________________
conv5-mask (Conv2DTranspose)    (800, 256, 28, 28)   262400      mask-conv-l3[0][0]               
__________________________________________________________________________________________________
mask_fcn_logits (Conv2D)        (800, 2, 28, 28)     514         conv5-mask[0][0]                 
__________________________________________________________________________________________________
mask_postprocess (MaskPostproce (8, 100, 28, 28)     0           mask_fcn_logits[0][0]            
                                                                 gpu_detections[0][2]             
__________________________________________________________________________________________________
mask_sigmoid (Activation)       (8, 100, 28, 28)     0           mask_postprocess[0][0]           
==================================================================================================
Total params: 24,646,107
Trainable params: 4,822,784
Non-trainable params: 19,823,323
__________________________________________________________________________________________________
INFO:tensorflow:Done calling model_fn.
2025-01-09 12:47:07,435 [TAO Toolkit] [INFO] tensorflow 1150: Done calling model_fn.
INFO:tensorflow:Graph was finalized.
2025-01-09 12:47:07,729 [TAO Toolkit] [INFO] tensorflow 240: Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpzvprwraj/model.ckpt-175000
2025-01-09 12:47:07,731 [TAO Toolkit] [INFO] tensorflow 1284: Restoring parameters from /tmp/tmpzvprwraj/model.ckpt-175000
INFO:tensorflow:Running local_init_op.
2025-01-09 12:47:08,006 [TAO Toolkit] [INFO] tensorflow 500: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2025-01-09 12:47:08,034 [TAO Toolkit] [INFO] tensorflow 502: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 175000 into /tmp/tmp2p82l2u2/model.ckpt.
2025-01-09 12:47:08,895 [TAO Toolkit] [INFO] tensorflow 606: Saving checkpoints for 175000 into /tmp/tmp2p82l2u2/model.ckpt.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/export/exporter.py:244: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2025-01-09 12:47:19,956 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/export/exporter.py:244: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

INFO:tensorflow:Restoring parameters from /tmp/tmp2p82l2u2/model.ckpt-175000
2025-01-09 12:47:20,163 [TAO Toolkit] [INFO] tensorflow 1284: Restoring parameters from /tmp/tmp2p82l2u2/model.ckpt-175000
INFO:tensorflow:Froze 107 variables.
2025-01-09 12:47:20,457 [TAO Toolkit] [INFO] tensorflow 334: Froze 107 variables.
INFO:tensorflow:Converted 107 variables to const ops.
2025-01-09 12:47:20,540 [TAO Toolkit] [INFO] tensorflow 394: Converted 107 variables to const ops.
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/export/exporter.py:287: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2025-01-09 12:47:20,732 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/export/exporter.py:287: The name tf.reset_default_graph is deprecated. Please use tf.compat.v1.reset_default_graph instead.

2025-01-09 12:47:20,733 [TAO Toolkit] [INFO] numba.cuda.cudadrv.driver 266: init
NOTE: UFF has been tested with TensorFlow 1.15.0.
WARNING: The version of TensorFlow installed on this system is not guaranteed to work with UFF.
Warning: No conversion function registered for layer: MultilevelCropAndResize_TRT yet.
Converting pyramid_crop_and_resize_mask as custom op: MultilevelCropAndResize_TRT
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/uff/converters/tensorflow/converter.py:226: The name tf.AttrValue is deprecated. Please use tf.compat.v1.AttrValue instead.

2025-01-09 12:47:20,995 [TAO Toolkit] [WARNING] tensorflow 137: From /usr/local/lib/python3.8/dist-packages/uff/converters/tensorflow/converter.py:226: The name tf.AttrValue is deprecated. Please use tf.compat.v1.AttrValue instead.

Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting nearest_upsampling_2 as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting nearest_upsampling_1 as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: ResizeNearest_TRT yet.
Converting nearest_upsampling as custom op: ResizeNearest_TRT
Warning: No conversion function registered for layer: SpecialSlice_TRT yet.
Converting mrcnn_detection_bboxes as custom op: SpecialSlice_TRT
Warning: No conversion function registered for layer: GenerateDetection_TRT yet.
Converting generate_detections as custom op: GenerateDetection_TRT
Warning: No conversion function registered for layer: MultilevelProposeROI_TRT yet.
Converting multilevel_propose_rois as custom op: MultilevelProposeROI_TRT
Warning: No conversion function registered for layer: MultilevelCropAndResize_TRT yet.
Converting pyramid_crop_and_resize_box as custom op: MultilevelCropAndResize_TRT
DEBUG [/usr/local/lib/python3.8/dist-packages/uff/converters/tensorflow/converter.py:143] Marking ['generate_detections', 'mask_fcn_logits/BiasAdd'] as outputs
2025-01-09 12:47:21,217 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.mask_rcnn.export.exporter 300: **Converted model was saved into /workspace/tao-experiments/mask_rcnn/experiment_dir_unpruned/model.epoch-5.uff**
loading annotations into memory...
Done (t=1.15s)
creating index...
index created!
[01/09/2025-12:47:22] [TRT] [I] [MemUsageChange] Init CUDA: CPU +12, GPU +0, now: CPU 654, GPU 848 (MiB)
[01/09/2025-12:47:23] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +546, GPU +118, now: CPU 1254, GPU 966 (MiB)
[01/09/2025-12:47:24] [TRT] [W] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible.
[01/09/2025-12:47:24] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +1, GPU +10, now: CPU 1477, GPU 974 (MiB)
[01/09/2025-12:47:24] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1478, GPU 984 (MiB)
[01/09/2025-12:47:24] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[01/09/2025-12:47:36] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[01/09/2025-12:49:33] [TRT] [I] [GraphReduction] The approximate region cut reduction algorithm is called.
[01/09/2025-12:49:33] [TRT] [I] Total Activation Memory: 10870064128
[01/09/2025-12:49:33] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[01/09/2025-12:49:33] [TRT] [I] Total Host Persistent Memory: 125232
[01/09/2025-12:49:33] [TRT] [I] Total Device Persistent Memory: 5828608
[01/09/2025-12:49:33] [TRT] [I] Total Scratch Memory: 854353408
[01/09/2025-12:49:33] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 67 MiB, GPU 3069 MiB
[01/09/2025-12:49:33] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 118 steps to complete.
[01/09/2025-12:49:33] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 10.272ms to assign 20 blocks to 118 nodes requiring 2348518400 bytes.
[01/09/2025-12:49:33] [TRT] [I] Total Activation Memory: 2348518400
[01/09/2025-12:49:34] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1894, GPU 1094 (MiB)
[01/09/2025-12:49:34] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1895, GPU 1104 (MiB)
[01/09/2025-12:49:34] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +19, GPU +100, now: CPU 19, GPU 100 (MiB)
Execution status: PASS
2025-01-09 18:19:44,045 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Please help.

Thanks.

Currently, Mask_rcnn does not support export to ONNX file yet. Suggest you to use trtexec to generate tensorrt engine for running inference on GPU.
Please refer to Converting TAO-trained MaskRCNN models to ONNX for CPU inference - #4 by telo.
Also, there is Mask2Former as alternative. Please refer to latest TAO 5.5 user guide or notebook.

Dear @Morganh ,

We are trying to train Mask2Former_inst model we after 1 epoch training automatically get crashed.

Below it the configuration.

results_dir: /results_inst/
dataset:
  contiguous_id: True
  label_map: /specs/labelmap_inst.json
  train:
    type: 'coco'
    name: "coco_2017_train"
    instance_json: "/data/raw-data/annotations/coco_annotations_train_fixed_largeset.json"
    img_dir: "/data/raw-data/train"
    batch_size: 8
    num_workers: 2
  val:
    type: 'coco'
    name: "coco_2017_val"
    instance_json: "/data/raw-data/annotations/coco_annotations_val_fixed_largeset.json"
    img_dir: "/data/raw-data/val"
    batch_size: 1
    num_workers: 2
  test:
    img_dir: /data/raw-data/val
    batch_size: 1
  augmentation:
    train_min_size: [640]
    train_max_size: 640
    train_crop_size: [640, 640]
    test_min_size: 640
    test_max_size: 640
train:
  precision: 'fp16'
  num_gpus: 1
  checkpoint_interval: 1
  validation_interval: 1
  num_epochs: 50
  optim:
    lr_scheduler: "MultiStep"
    milestones: [44, 48]
    type: "AdamW"
    lr: 0.0001
    weight_decay: 0.05
model:
  object_mask_threshold: 0.1
  overlap_threshold: 0.8
  mode: "instance"
  backbone:
    pretrained_weights: "/workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth"
    type: "swin"
    swin:
      type: "tiny"
      window_size: 7
      ape: False
      pretrain_img_size: 224
  mask_former:
    num_object_queries: 100
  sem_seg_head:
    norm: "GN"
    num_classes: 80
export:
  input_channel: 3
  input_width: 640
  input_height: 640
  opset_version: 17
  batch_size: -1  # dynamic batch size
  on_cpu: False
gen_trt_engine:
  gpu_id: 0
  input_channel: 3
  input_width: 640
  input_height: 640
  tensorrt:
    data_type: fp16
    workspace_size: 4096
    min_batch_size: 1
    opt_batch_size: 1
    max_batch_size: 1

Training Section:

print("For multi-GPU, set NUM_TRAIN_GPUS based on your machine.")
os.environ["NUM_TRAIN_GPUS"] = "1"
os.environ["HYDRA_FULL_ERROR"] = "1"
!tao model mask2former train -e $SPECS_DIR/spec_inst1.yaml \
           train.num_gpus=$NUM_TRAIN_GPUS \
           results_dir=$RESULTS_DIR

Training logs:

/usr/local/lib/python3.6/pty.py:84: ResourceWarning: Unclosed socket <zmq.Socket(zmq.PUSH) at 0x782256094648>
  pid, fd = os.forkpty()
For multi-GPU, set NUM_TRAIN_GPUS based on your machine.
2025-01-13 12:04:17,530 [TAO Toolkit] [INFO] root 160: Registry: ['nvcr.io']
2025-01-13 12:04:17,581 [TAO Toolkit] [INFO] nvidia_tao_cli.components.instance_handler.local_instance 361: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt
2025-01-13 12:04:17,603 [TAO Toolkit] [WARNING] nvidia_tao_cli.components.docker_handler.docker_handler 293: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/smarg/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
2025-01-13 12:04:17,603 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 301: Printing tty value True
[2025-01-13 06:34:21,081 - TAO Toolkit - matplotlib.font_manager - INFO] generated new fontManager
sys:1: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning: 
'spec_inst1.yaml' is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
  _run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
  ret = run_job(
Train results will be saved at: /results_inst/train
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /results_inst/train/status.json
  rank_zero_warn(
Seed set to 1234
loading annotations into memory...
Done (t=5.39s)
creating index...
index created!
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]Loading backbone weights from: /workspace/tao-experiments/mask2former/swin_tiny_patch4_window7_224_22k.pth
The backbone weights were loaded successfuly.
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:652: Checkpoint directory /results_inst/train exists and is not empty.
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  | Name      | Type            | Params
----------------------------------------------
0 | model     | MaskFormerModel | 47.4 M
1 | criterion | SetCriterion    | 0     
----------------------------------------------
47.4 M    Trainable params
0         Non-trainable params
47.4 M    Total params
189.687   Total estimated model params size (MB)

Sanity Checking: |          | 0/? [00:00<?, ?it/s]loading annotations into memory...Done (t=0.88s)
creating index...
index created!

Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00,  2.10it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union

                                                                           
loading annotations into memory...
Done (t=5.51s)
creating index...
index created!

Epoch 0: 100%|██████████| 6250/6250 [1:25:22<00:00,  1.22it/s, v_num=1, train_loss=6.460, lr=0.0001]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/7927 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 1/7927 [00:00<17:36,  7.50it/s]
Validation DataLoader 0:   0%|          | 2/7927 [00:00<16:02,  8.23it/s]
Validation DataLoader 0:   0%|          | 3/7927 [00:00<15:23,  8.58it/s]
Validation DataLoader 0:   0%|          | 4/7927 [00:00<14:18,  9.23it/s]
Validation DataLoader 0:   0%|          | 5/7927 [00:00<13:48,  9.56it/s]
Validation DataLoader 0:   0%|          | 6/7927 [00:00<12:56, 10.20it/s]
Validation DataLoader 0:   0%|          | 7/7927 [00:00<12:38, 10.44it/s]
Validation DataLoader 0:   0%|          | 8/7927 [00:00<12:48, 10.31it/s]
Validation DataLoader 0:   0%|          | 9/7927 [00:00<12:54, 10.22it/s]
Validation DataLoader 0:   0%|          | 10/7927 [00:00<13:02, 10.12it/s]
Validation DataLoader 0:   0%|          | 11/7927 [00:01<12:37, 10.45it/s]
Validation DataLoader 0:   0%|          | 12/7927 [00:01<12:17, 10.73it/s]
Validation DataLoader 0:   0%|          | 13/7927 [00:01<11:59, 10.99it/s]
Validation DataLoader 0:   0%|          | 14/7927 [00:01<11:54, 11.08it/s]
Validation DataLoader 0:   0%|          | 15/7927 [00:01<12:01, 10.97it/s]
Validation DataLoader 0:   0%|          | 16/7927 [00:01<12:10, 10.83it/s]
Validation DataLoader 0:   0%|          | 17/7927 [00:01<12:16, 10.74it/s]
Validation DataLoader 0:   0%|          | 18/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 19/7927 [00:01<12:30, 10.54it/s]
Validation DataLoader 0:   0%|          | 20/7927 [00:01<12:25, 10.61it/s]
Validation DataLoader 0:   0%|          | 21/7927 [00:01<12:28, 10.56it/s]
Validation DataLoader 0:   0%|          | 22/7927 [00:02<12:24, 10.62it/s]
Validation DataLoader 0:   0%|          | 23/7927 [00:02<12:28, 10.57it/s]
Validation DataLoader 0:   0%|          | 24/7927 [00:02<12:32, 10.50it/s]
Validation DataLoader 0:   0%|          | 25/7927 [00:02<12:36, 10.45it/s]
Validation DataLoader 0:   0%|          | 26/7927 [00:02<12:39, 10.41it/s]
Validation DataLoader 0:   0%|          | 27/7927 [00:02<12:42, 10.36it/s]
Validation DataLoader 0:   0%|          | 28/7927 [00:02<12:44, 10.33it/s]
Validation DataLoader 0:   0%|          | 29/7927 [00:02<12:40, 10.38it/s]
Validation DataLoader 0:   0%|          | 30/7927 [00:02<12:38, 10.41it/s]
Validation DataLoader 0:   0%|          | 31/7927 [00:02<12:41, 10.37it/s]
Validation DataLoader 0:   0%|          | 32/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 33/7927 [00:03<12:45, 10.31it/s]
Validation DataLoader 0:   0%|          | 34/7927 [00:03<12:47, 10.28it/s]
Validation DataLoader 0:   0%|          | 35/7927 [00:03<12:44, 10.32it/s]
Validation DataLoader 0:   0%|          | 36/7927 [00:03<12:46, 10.30it/s]
Validation DataLoader 0:   0%|          | 37/7927 [00:03<12:43, 10.34it/s]
Validation DataLoader 0:   0%|          | 38/7927 [00:03<12:37, 10.42it/s]
Validation DataLoader 0:   0%|          | 39/7927 [00:03<12:35, 10.45it/s]
Validation DataLoader 0:   1%|          | 40/7927 [00:03<12:38, 10.40it/s]
Validation DataLoader 0:   1%|          | 41/7927 [00:03<12:37, 10.41it/s]
Validation DataLoader 0:   1%|          | 42/7927 [00:04<12:36, 10.42it/s]
Validation DataLoader 0:   1%|          | 43/7927 [00:04<12:38, 10.40it/s]
.
.
.
.
.
.
.

Validation DataLoader 0: 100%|█████████▉| 7914/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7915/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7916/7927 [12:50<00:01, 10.28it/s]
Validation DataLoader 0: 100%|█████████▉| 7917/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7918/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7919/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7920/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7921/7927 [12:50<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7922/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7923/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7924/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7925/7927 [12:51<00:00, 10.27it/s]
Validation DataLoader 0: 100%|█████████▉| 7926/7927 [12:51<00:00, 10.28it/s]
Validation DataLoader 0: 100%|██████████| 7927/7927 [12:51<00:00, 10.28it/s]/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/mask2former/model/pl_model.py:443: RuntimeWarning: invalid value encountered in divide  iou = total_area_intersect / total_area_union


                                                                            
Epoch 0: 100%|██████████| 6250/6250 [1:38:14<00:00,  1.06it/s, v_num=1, train_loss=6.460, lr=0.0001, val_loss=11.20, mIoU=1.000, all_acc=1.000][2025-01-13 08:15:15,069 - TAO Toolkit - root - INFO] Sending telemetry data.
[2025-01-13 08:15:15,082 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-13 08:15:15,085 - TAO Toolkit - root - INFO] Sending {'version': '5.5.0', 'action': 'train', 'network': 'mask2former', 'gpu': ['NVIDIA-RTX-A4000'], 'success': False, 'time_lapsed': 6053} to https://api.tao.ngc.nvidia.com.
[2025-01-13 08:15:16,813 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-13 08:15:16,814 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-13 08:15:16,814 - TAO Toolkit - root - WARNING] Execution status: FAIL
2025-01-13 13:45:20,751 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

where we are making mistake? Please help.

Thanks.

Could you please create a new forum topic? Thanks!

Yes. We have created. Mask2Former_inst model training crashed after 1 epoch