TAO MaskRCNN inference output problem

Please provide the following information when requesting support.

• Network Type (Mask_rcnn)
• TLT Version: 5.0.0-deploy
• How to reproduce the issue ?

I am trying to run inference on a MaskRCNN task and extract the COCO annotations in txt/json/whatever format.
In the documentation the available flags contain --out_label_path to specify the output path, but this tag is not available in our implementation:

usage: mask_rcnn inference [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp] [--log_file LOG_FILE] -m MODEL_PATH -i IMAGE_DIR [-k KEY]
                           [-c CLASS_MAP] [-t THRESHOLD] [--include_mask] -e EXPERIMENT_SPEC [-r RESULTS_DIR]
                           {train,prune,inference_trt,inference,export,evaluate,dataset_convert} ...

optional arguments:
  -h, --help            show this help message and exit
  --num_processes NUM_PROCESSES, -np NUM_PROCESSES
                        The number of horovod child processes to be spawned. Default is -1(equal to --gpus).
  --gpus GPUS           The number of GPUs to be used for the job.
  --gpu_index GPU_INDEX [GPU_INDEX ...]
                        The indices of the GPU's to be used.
  --use_amp             Flag to enable Auto Mixed Precision.
  --log_file LOG_FILE   Path to the output log file.
  -m MODEL_PATH, --model_path MODEL_PATH
                        Path to a MaskRCNN model.
  -i IMAGE_DIR, --image_dir IMAGE_DIR
                        Path to the input image directory.
  -k KEY, --key KEY     Encryption key.
  -c CLASS_MAP, --class_map CLASS_MAP
                        Path to the label file.
  -t THRESHOLD, --threshold THRESHOLD
                        Bbox confidence threshold.
  --include_mask        Whether to draw masks.
  -e EXPERIMENT_SPEC, --experiment_spec EXPERIMENT_SPEC
                        Path to spec file. Absolute path or relative to working directory. If not specified, default spec from spec_loader.py is used.
  -r RESULTS_DIR, --results_dir RESULTS_DIR
                        Output directory where the status log is saved.

I checked the versions and TAO is at version 5.0.0. I am using the exact command as specified in the Getting Started notebook:

tao model mask_rcnn inference -i $DATA_DOWNLOAD_DIR/infer_samples \
                        -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt \
                        -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/model.epoch-$NUM_EPOCH.tlt \
                        -c $SPECS_DIR/coco_labels.txt \
                        -r $INFERENCE_OUTPUT_DATA_DIR \
                        -t 0.5 \
                        --include_mask

The images are exported correctly, but that is not a usable format for our purposes.
Any help is appreciated.

Best,
PA

The output label folder will be auto generated according to https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference.py#L321.

First of all, thank you for the quick reply.

According to this, depending on whether the model is TLT or TRTengine, either infer() or infer_trt() will run.

In infer_trt() there is, indeed, created an out_label_path, which I assume stores the output labels.

However, infer() does not seem to create such an output folder. Could you please check if my finding is correct?

I am trying to run inference on TLT models, and would like to avoid having to compile them for TRT.

Kind regards,
PA

I followed the methods called all the way to evaluation.py:infer() and, from what I’m seeing, label txts are only stored for KITTI labels (line 315) , whereas images with drawn annotations are stored anyhow (line 311).

Yes, correct. This flag is only supported with the TensorRT engine.
Refer to MaskRCNN - NVIDIA Docs

  • -l, --out_label_path: The directory for predicted labels in COCO format. This argument is only supported with the TensorRT engine.

To generate Tensorrt engine, there are at least two ways.

  1. Using trtexec, refer to TRTEXEC with Mask RCNN - NVIDIA Docs.
  2. Using tao deploy docker. Run gen_trt_engine. Refer to Mask RCNN with TAO Deploy - NVIDIA Docs
    Source code: https://github.com/NVIDIA/tao_deploy/tree/main/nvidia_tao_deploy/cv/mask_rcnn/scripts

So it is not possible to export the labels when using a TLT model, but only the labeled images?
This would seem like an oversight. It would be really helpful during troubleshooting the model if we could run inference and export the labels, without having to wait extra time to compile.

For now, I guess I will have to export to TRT, but that seems like an unnecessary step. Am I missing something here?

Thanks,
PA

Hi,
Actually, when running inference against a tlt model, the output images are already annotated with bbox. It can help for troubleshooting.
You can also modify https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/mask_rcnn/utils/evaluation.py#L313-L327 to get the label files. Steps: Login inside the docker, and modify the /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/utils/evaluation.py . Then run inference command inside the docker.

Hi, thank you for the assistance.
We are running it for now after compiling to TRT, but we are getting two errors.

1. If we run inference()

tao model mask_rcnn inference -i $DATA_DOWNLOAD_DIR/infer_samples \
                            -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt \
                            -m $USER_EXPERIMENT_DIR/export/model.epoch-$NUM_EPOCH.engine \
                            -o $INFERENCE_OUTPUT_DATA_DIR \
                            -t 0.5 \
                            --include_mask
                            #-l $INFERENCE_OUTPUT_DATA_DIR \

we get an error that our flags are incorrect:

usage: mask_rcnn inference [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp] [--log_file LOG_FILE] -m MODEL_PATH -i IMAGE_DIR [-k KEY]
                           [-c CLASS_MAP] [-t THRESHOLD] [--include_mask] -e EXPERIMENT_SPEC [-r RESULTS_DIR]
                           {train,prune,inference_trt,inference,export,evaluate,dataset_convert} ...
mask_rcnn inference: error: argument /tasks: invalid choice: '/workspace/tao-experiments/mask_rcnn/inference' (choose from 'train', 'prune', 'inference_trt', 'inference', 'export', 'evaluate', 'dataset_convert')

2. If we run inference_trt()

tao model mask_rcnn inference_trt -i $DATA_DOWNLOAD_DIR/infer_samples \
                            -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt \
                            -m $USER_EXPERIMENT_DIR/export/model.epoch-$NUM_EPOCH.engine \
                            -o $INFERENCE_OUTPUT_DATA_DIR \
                            -t 0.5 \
                            --include_mask
                            #-l $INFERENCE_OUTPUT_DATA_DIR \

we get the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 416, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 409, in main
    inferencer.infer(arguments.in_image_path,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 318, in infer
    self._inference_folder(img_in_path, img_out_path, label_out_path, draw_mask)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 288, in _inference_folder
    y_pred_decoded = self._predict_batch(inf_inputs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 208, in _predict_batch
    y_pred = self.pred_fn(np.array(inf_inputs))
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/inferencer/trt_inferencer.py", line 126, in infer_batch
    results = do_inference(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/inferencer/engine.py", line 45, in do_inference
    stream.synchronize()
pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered
[10/23/2023-12:41:58] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[10/23/2023-12:41:58] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[10/23/2023-12:41:58] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
- - - x10 - - -
[10/23/2023-12:41:59] [TRT] [E] /workspace/trt_oss_src/TensorRT/plugin/common/plugin.h (134) - Cuda Error in ~CudaBind: 46 (CUDA-capable device(s) is/are busy or unavailable)
terminate called after throwing an instance of 'nvinfer1::plugin::CudaError'
  what():  std::exception
Execution status: FAIL
2023-10-23 15:42:04,895 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Can you open a new terminal and run below commands to double check?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then inside the docker,
# mask_rcnn inference xxx
# mask_rcnn inference_trt xxx

Hi Morgan,

I tried it and it worked for most images. I got the same problem for only a specific subset of images that are probably corrupt in some way. There is no clear indication of what is wrong, they open fine with the Linux image viewer and OpenCV. TAO gives no clear error about what is wrong.

Thank you for your all your help, our pipeline is working. However, it would be much appreciated if the documentation page was updated with correct info on the inference() and inference_trt() functions and their arguments. It has been a confusing solution to a simple problem.

Kind regards,
PA

May I know the difference for these specific images? Are they of higher resolution or something else different?

Got it. We will improve the document. Thanks for the catching.

We are still looking into it.
All images in the set come from a single video, so are similar in every way. We used FFMPEG and OpenCV to resplit the video, and we still have the same problem. We used OpenCV, ImageMagick and Linux image viewer to check the images for corruption or other differences, none found.

We are also thinking about the corner case where there may be predicted masks that overflow from the image’s borders.

Please set a larger max_num_instances and retry.
Default is max_num_instances: 200.

Thank you,

We will be able to check this out hopefully within the next couple of weeks.

Hello,

We tried to rerun with a different dataset (same pipeline source), set max_num_instances: 400 and ran into the same exact problem: some images run through the inference process, others break it.
We are looking into it and have seen that there is at least another post with the same problem, but no solution. Inference with TAO is important to us, so we will keep exploring to find any solution, but would also appreciate any help from here.

Thank you

Could you please upload the latest full log? Thanks.

Hi, attached you will find the logfile.
Inference attempt on a single image; with multiple images it will run until it breaks at some point.
inference_trt.log (27.8 KB)

Could you run below and sure the result?

$nvidia-smi

$ docker run --runtime=nvidia -it -v local_folder:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash
then,

mask_rcnn inference_trt -i /workspace/tao-experiments/data/infer_samples -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao-experiments/mask_rcnn/export/model.epoch-483.engine -o /workspace/tao-experiments/mask_rcnn/inference -l /workspace/tao-experiments/mask_rcnn/inference -t 0.5 --include_mask
(env_tao) minibeast@miniBeast:/mnt/NVME_DATA/env_Projects/viewer_ws/viewer_tao_segmentation$ nvidia-smi
Fri Nov 17 17:29:37 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:01:00.0  On |                  N/A |
|  0%   39C    P8              15W / 300W |    279MiB / 11264MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1248      G   /usr/lib/xorg/Xorg                           35MiB |
|    0   N/A  N/A      2239      G   /usr/lib/xorg/Xorg                          112MiB |
|    0   N/A  N/A      2435      G   /opt/teamviewer/tv_bin/TeamViewer             2MiB |
|    0   N/A  N/A      2459      G   /usr/bin/gnome-shell                         79MiB |
|    0   N/A  N/A     32719      G   ...sion,SpareRendererForSitePerProcess       37MiB |
(env_tao) minibeast@miniBeast:/mnt/NVME_DATA/env_Projects/viewer_ws/viewer_tao_segmentation$ docker run --runtime=nvidia -it -v /mnt/NVME_DATA/Training_Sessions/TAO_experiments:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash

=======================
=== TAO Toolkit Deploy ===
=======================

NVIDIA Release 5.0.0-Deploy (build 52693241)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

root@51688c9b47d0:/opt/nvidia# mask_rcnn inference_trt -i /workspace/tao_lleida_canopy/data/infer_samples -e /workspace/tao_lleida_canopy/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao_lleida_canopy/mask_rcnn/export/model.epoch-483.engine -o /workspace/tao_lleida_canopy/mask_rcnn/inference -l /workspace/tao_lleida_canopy/mask_rcnn/inference -t 0.5 --include_mask
2023-11-17 15:32:14,625 [INFO] matplotlib.font_manager: generated new fontManager
Loading uff directly from the package source code
usage: mask_rcnn [-h] [--gpu_index GPU_INDEX] [--log_file LOG_FILE] {evaluate,gen_trt_engine,inference} ...
mask_rcnn: error: invalid choice: 'inference_trt' (choose from 'evaluate', 'gen_trt_engine', 'inference')
root@51688c9b47d0:/opt/nvidia#