PeopleSegNet cannot be used to de inference

Hello. I’ve tried three pretrained models which are from PeopleSegNet NGC official website to model inference. (peoplesegnet_resnet50.etlt, peoplesegnet_resnet50.tlt, and peoplesegnet_resnet50.step-20000.tlt)

However. I got three errors of inference results:

  1. peoplesegnet_resnet50.etlt: ValueError: Model extension needs to be either .engine or .tlt.

  2. peoplesegnet_resnet50.tlt: AssertionError: The pruned model must be retrained first.

  3. peoplesegnet_resnet50.step-20000.tlt:

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-03-28 09:38:42,813 [INFO] root: Starting MaskRCNN inference.
Label file does not exist. Skipping...
2023-03-28 09:38:42,813 [INFO] iva.mask_rcnn.utils.spec_loader: Loading specification from /workspace/tao-experiments/maskrcnn_retrain_resnet50.txt
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmpuk9wazjk', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f89c34f7080>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
2023-03-28 09:38:42,816 [INFO] tensorflow: Using config: {'_model_dir': '/tmp/tmpuk9wazjk', '_tf_random_seed': 123, '_save_summary_steps': None, '_save_checkpoints_steps': None, '_save_checkpoints_secs': None, '_session_config': gpu_options {
  allow_growth: true
  force_gpu_compatible: true
}
allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: TWO
  }
}
, '_keep_checkpoint_max': 20, '_keep_checkpoint_every_n_hours': None, '_log_step_count_steps': None, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f89c34f7080>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
[MaskRCNN] INFO    : Running inference...
[MaskRCNN] INFO    : Loading weights from /workspace/tao-experiments/peoplesegnet_resnet50.step-20000.tlt
2023-03-28 09:38:45,094 [INFO] root: The last checkpoint file is not saved properly.                 Please delete it and rerun the script.
Traceback (most recent call last):
  File "<frozen iva.mask_rcnn.executer.distributed_executer>", line 352, in extract_ckpt
  File "/usr/lib/python3.6/zipfile.py", line 1131, in __init__
    self._RealGetContents()
  File "/usr/lib/python3.6/zipfile.py", line 1198, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/mask_rcnn/scripts/inference.py>", line 3, in <module>
  File "<frozen iva.mask_rcnn.scripts.inference>", line 390, in <module>
  File "<frozen iva.mask_rcnn.scripts.inference>", line 378, in <module>
  File "<frozen iva.mask_rcnn.scripts.inference>", line 365, in main
  File "<frozen iva.mask_rcnn.scripts.inference>", line 311, in infer
  File "<frozen iva.mask_rcnn.executer.distributed_executer>", line 503, in infer
  File "<frozen iva.mask_rcnn.executer.distributed_executer>", line 357, in extract_ckpt
OSError: The last checkpoint file is not saved properly.                 Please delete it and rerun the script.
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL

This is my command:

docker run -it --rm -v /home/ubuntu/tao_test_2023/tensorflow_train/model_zoo/peoplesegnet:/workspace/tao-experiments 
                    nvcr.io/nvidia/tao/tao-toolkit:4.0.1-tf1.15.5 
                    mask_rcnn inference 
                    -i /workspace/tao-experiments/data -o /workspace/tao-experiments/ 
                    -e /workspace/tao-experiments/maskrcnn_retrain_resnet50.txt 
                    -m /workspace/tao-experiments/peoplesegnet_resnet50.tlt 
                    -l /workspace/tao-experiments/coco_labels.txt -t 0.5 
                    -k nvidia_tao 
                    --include_mask

How should I do generate inference result correctly ?

When running inference with “mask_rcnn inference”, it is using .tlt file or .engine file.

Can you share training spec file and command?

Here is my spec file used for inference. Which parts should I edit to deal with the problem OSError: The last checkpoint file is not saved properly ?

seed: 123
use_amp: False
warmup_steps: 10000
learning_rate_steps: "[100000, 150000, 200000]"
learning_rate_decay_levels: "[0.1, 0.02, 0.01]"
total_steps: 250000
train_batch_size: 2
eval_batch_size: 4
num_steps_per_eval: 5000
momentum: 0.9
l2_weight_decay: 0.00004
warmup_learning_rate: 0.0001
init_learning_rate: 0.005
num_examples_per_epoch: 118288
pruned_model_path: "/workspace/tao-experiments/mask_rcnn/peoplesegnet_resnet50.step-20000.tlt"

data_config{
    image_size: "(576, 960)"
    augment_input_data: True
    eval_samples: 500
    training_file_pattern: "/workspace/tao-experiments/data/train*.tfrecord"
    validation_file_pattern: "/workspace/tao-experiments/data/val*.tfrecord"
    val_json_file: "/workspace/tao-experiments/data/raw-data/annotations/instances_val2017.json"

    # dataset specific parameters
    num_classes: 91
    skip_crowd_during_training: True
}

maskrcnn_config {
    nlayers: 50
    arch: "resnet"
    freeze_bn: True
    freeze_blocks: "[0,1]"
    gt_mask_size: 112
        
    # Region Proposal Network
    rpn_positive_overlap: 0.7
    rpn_negative_overlap: 0.3
    rpn_batch_size_per_im: 256
    rpn_fg_fraction: 0.5
    rpn_min_size: 0.

    # Proposal layer.
    batch_size_per_im: 512
    fg_fraction: 0.25
    fg_thresh: 0.5
    bg_thresh_hi: 0.5
    bg_thresh_lo: 0.

    # Faster-RCNN heads.
    fast_rcnn_mlp_head_dim: 1024
    bbox_reg_weights: "(10., 10., 5., 5.)"

    # Mask-RCNN heads.
    include_mask: True
    mrcnn_resolution: 28

    # training
    train_rpn_pre_nms_topn: 2000
    train_rpn_post_nms_topn: 1000
    train_rpn_nms_threshold: 0.7

    # evaluation
    test_detections_per_image: 100
    test_nms: 0.5
    test_rpn_pre_nms_topn: 1000
    test_rpn_post_nms_topn: 1000
    test_rpn_nms_thresh: 0.7

    # model architecture
    min_level: 2
    max_level: 6
    num_scales: 1
    aspect_ratios: "[(1.0, 1.0), (1.4, 0.7), (0.7, 1.4)]"
    anchor_scale: 8

    # localization loss
    rpn_box_loss_weight: 1.0
    fast_rcnn_box_loss_weight: 1.0
    mrcnn_weight_loss_mask: 1.0
}

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please refer to the command mentioned in MaskRCNN - NVIDIA Docs
You can set the model in “-m”.

tao mask_rcnn inference [-h] -i <input directory>
                             -o <output annotated image directory>
                             -e <experiment spec file>
                             -m <model file>
                             -k <key>

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.