TAO MaskRCNN inference output problem

user101417 · October 16, 2023, 9:35am

Please provide the following information when requesting support.

• Network Type (Mask_rcnn)
• TLT Version: 5.0.0-deploy
• How to reproduce the issue ?

I am trying to run inference on a MaskRCNN task and extract the COCO annotations in txt/json/whatever format.
In the documentation the available flags contain --out_label_path to specify the output path, but this tag is not available in our implementation:

usage: mask_rcnn inference [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp] [--log_file LOG_FILE] -m MODEL_PATH -i IMAGE_DIR [-k KEY]
                           [-c CLASS_MAP] [-t THRESHOLD] [--include_mask] -e EXPERIMENT_SPEC [-r RESULTS_DIR]
                           {train,prune,inference_trt,inference,export,evaluate,dataset_convert} ...

optional arguments:
  -h, --help            show this help message and exit
  --num_processes NUM_PROCESSES, -np NUM_PROCESSES
                        The number of horovod child processes to be spawned. Default is -1(equal to --gpus).
  --gpus GPUS           The number of GPUs to be used for the job.
  --gpu_index GPU_INDEX [GPU_INDEX ...]
                        The indices of the GPU's to be used.
  --use_amp             Flag to enable Auto Mixed Precision.
  --log_file LOG_FILE   Path to the output log file.
  -m MODEL_PATH, --model_path MODEL_PATH
                        Path to a MaskRCNN model.
  -i IMAGE_DIR, --image_dir IMAGE_DIR
                        Path to the input image directory.
  -k KEY, --key KEY     Encryption key.
  -c CLASS_MAP, --class_map CLASS_MAP
                        Path to the label file.
  -t THRESHOLD, --threshold THRESHOLD
                        Bbox confidence threshold.
  --include_mask        Whether to draw masks.
  -e EXPERIMENT_SPEC, --experiment_spec EXPERIMENT_SPEC
                        Path to spec file. Absolute path or relative to working directory. If not specified, default spec from spec_loader.py is used.
  -r RESULTS_DIR, --results_dir RESULTS_DIR
                        Output directory where the status log is saved.

I checked the versions and TAO is at version 5.0.0. I am using the exact command as specified in the Getting Started notebook:

tao model mask_rcnn inference -i $DATA_DOWNLOAD_DIR/infer_samples \
                        -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt \
                        -m $USER_EXPERIMENT_DIR/experiment_dir_retrain/model.epoch-$NUM_EPOCH.tlt \
                        -c $SPECS_DIR/coco_labels.txt \
                        -r $INFERENCE_OUTPUT_DATA_DIR \
                        -t 0.5 \
                        --include_mask

The images are exported correctly, but that is not a usable format for our purposes.
Any help is appreciated.

Best,
PA

Morganh · October 16, 2023, 4:23pm

The output label folder will be auto generated according to https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference.py#L321.

user101417 · October 17, 2023, 2:08pm

First of all, thank you for the quick reply.

According to this, depending on whether the model is TLT or TRTengine, either infer() or infer_trt() will run.

In infer_trt() there is, indeed, created an out_label_path, which I assume stores the output labels.

However, infer() does not seem to create such an output folder. Could you please check if my finding is correct?

I am trying to run inference on TLT models, and would like to avoid having to compile them for TRT.

Kind regards,
PA

user101417 · October 17, 2023, 2:32pm

I followed the methods called all the way to evaluation.py:infer() and, from what I’m seeing, label txts are only stored for KITTI labels (line 315) , whereas images with drawn annotations are stored anyhow (line 311).

Morganh · October 18, 2023, 8:33am

Yes, correct. This flag is only supported with the TensorRT engine.
Refer to MaskRCNN - NVIDIA Docs

-l, --out_label_path: The directory for predicted labels in COCO format. This argument is only supported with the TensorRT engine.

Morganh · October 18, 2023, 8:37am

To generate Tensorrt engine, there are at least two ways.

Using trtexec, refer to TRTEXEC with Mask RCNN - NVIDIA Docs.
Using tao deploy docker. Run gen_trt_engine. Refer to Mask RCNN with TAO Deploy - NVIDIA Docs
Source code: https://github.com/NVIDIA/tao_deploy/tree/main/nvidia_tao_deploy/cv/mask_rcnn/scripts

user101417 · October 18, 2023, 1:42pm

So it is not possible to export the labels when using a TLT model, but only the labeled images?
This would seem like an oversight. It would be really helpful during troubleshooting the model if we could run inference and export the labels, without having to wait extra time to compile.

For now, I guess I will have to export to TRT, but that seems like an unnecessary step. Am I missing something here?

Thanks,
PA

Morganh · October 23, 2023, 3:43am

Hi,
Actually, when running inference against a tlt model, the output images are already annotated with bbox. It can help for troubleshooting.
You can also modify https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/mask_rcnn/utils/evaluation.py#L313-L327 to get the label files. Steps: Login inside the docker, and modify the /usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/utils/evaluation.py . Then run inference command inside the docker.

user101417 · October 23, 2023, 12:56pm

Hi, thank you for the assistance.
We are running it for now after compiling to TRT, but we are getting two errors.

1. If we run inference()

tao model mask_rcnn inference -i $DATA_DOWNLOAD_DIR/infer_samples \
                            -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt \
                            -m $USER_EXPERIMENT_DIR/export/model.epoch-$NUM_EPOCH.engine \
                            -o $INFERENCE_OUTPUT_DATA_DIR \
                            -t 0.5 \
                            --include_mask
                            #-l $INFERENCE_OUTPUT_DATA_DIR \

we get an error that our flags are incorrect:

usage: mask_rcnn inference [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]] [--use_amp] [--log_file LOG_FILE] -m MODEL_PATH -i IMAGE_DIR [-k KEY]
                           [-c CLASS_MAP] [-t THRESHOLD] [--include_mask] -e EXPERIMENT_SPEC [-r RESULTS_DIR]
                           {train,prune,inference_trt,inference,export,evaluate,dataset_convert} ...
mask_rcnn inference: error: argument /tasks: invalid choice: '/workspace/tao-experiments/mask_rcnn/inference' (choose from 'train', 'prune', 'inference_trt', 'inference', 'export', 'evaluate', 'dataset_convert')

2. If we run inference_trt()

tao model mask_rcnn inference_trt -i $DATA_DOWNLOAD_DIR/infer_samples \
                            -e $SPECS_DIR/maskrcnn_retrain_resnet50.txt \
                            -m $USER_EXPERIMENT_DIR/export/model.epoch-$NUM_EPOCH.engine \
                            -o $INFERENCE_OUTPUT_DATA_DIR \
                            -t 0.5 \
                            --include_mask
                            #-l $INFERENCE_OUTPUT_DATA_DIR \

we get the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 416, in <module>
    main()
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 409, in main
    inferencer.infer(arguments.in_image_path,
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 318, in infer
    self._inference_folder(img_in_path, img_out_path, label_out_path, draw_mask)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 288, in _inference_folder
    y_pred_decoded = self._predict_batch(inf_inputs)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/mask_rcnn/scripts/inference_trt.py", line 208, in _predict_batch
    y_pred = self.pred_fn(np.array(inf_inputs))
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/inferencer/trt_inferencer.py", line 126, in infer_batch
    results = do_inference(
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/inferencer/engine.py", line 45, in do_inference
    stream.synchronize()
pycuda._driver.LogicError: cuStreamSynchronize failed: an illegal memory access was encountered
[10/23/2023-12:41:58] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[10/23/2023-12:41:58] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[10/23/2023-12:41:58] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
- - - x10 - - -
[10/23/2023-12:41:59] [TRT] [E] /workspace/trt_oss_src/TensorRT/plugin/common/plugin.h (134) - Cuda Error in ~CudaBind: 46 (CUDA-capable device(s) is/are busy or unavailable)
terminate called after throwing an instance of 'nvinfer1::plugin::CudaError'
  what():  std::exception
Execution status: FAIL
2023-10-23 15:42:04,895 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Morganh · October 23, 2023, 4:29pm

Can you open a new terminal and run below commands to double check?
$ docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 /bin/bash
Then inside the docker,
# mask_rcnn inference xxx
# mask_rcnn inference_trt xxx

user101417 · October 24, 2023, 12:15pm

Hi Morgan,

I tried it and it worked for most images. I got the same problem for only a specific subset of images that are probably corrupt in some way. There is no clear indication of what is wrong, they open fine with the Linux image viewer and OpenCV. TAO gives no clear error about what is wrong.

Thank you for your all your help, our pipeline is working. However, it would be much appreciated if the documentation page was updated with correct info on the inference() and inference_trt() functions and their arguments. It has been a confusing solution to a simple problem.

Kind regards,
PA

Morganh · October 25, 2023, 8:24am

May I know the difference for these specific images? Are they of higher resolution or something else different?

Got it. We will improve the document. Thanks for the catching.

user101417 · October 25, 2023, 9:11am

We are still looking into it.
All images in the set come from a single video, so are similar in every way. We used FFMPEG and OpenCV to resplit the video, and we still have the same problem. We used OpenCV, ImageMagick and Linux image viewer to check the images for corruption or other differences, none found.

We are also thinking about the corner case where there may be predicted masks that overflow from the image’s borders.

Morganh · October 25, 2023, 9:20am

Please set a larger max_num_instances and retry.
Default is max_num_instances: 200.

user101417 · November 2, 2023, 10:12am

Thank you,

We will be able to check this out hopefully within the next couple of weeks.

user101417 · November 15, 2023, 8:17am

Hello,

We tried to rerun with a different dataset (same pipeline source), set max_num_instances: 400 and ran into the same exact problem: some images run through the inference process, others break it.
We are looking into it and have seen that there is at least another post with the same problem, but no solution. Inference with TAO is important to us, so we will keep exploring to find any solution, but would also appreciate any help from here.

Thank you

Morganh · November 15, 2023, 8:21am

Could you please upload the latest full log? Thanks.

user101417 · November 16, 2023, 8:54am

Hi, attached you will find the logfile.
Inference attempt on a single image; with multiple images it will run until it breaks at some point.
inference_trt.log (27.8 KB)

Morganh · November 17, 2023, 7:40am

Could you run below and sure the result?

$nvidia-smi

$ docker run --runtime=nvidia -it -v local_folder:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash
then,

mask_rcnn inference_trt -i /workspace/tao-experiments/data/infer_samples -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao-experiments/mask_rcnn/export/model.epoch-483.engine -o /workspace/tao-experiments/mask_rcnn/inference -l /workspace/tao-experiments/mask_rcnn/inference -t 0.5 --include_mask

user101417 · November 17, 2023, 3:33pm

(env_tao) minibeast@miniBeast:/mnt/NVME_DATA/env_Projects/viewer_ws/viewer_tao_segmentation$ nvidia-smi
Fri Nov 17 17:29:37 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:01:00.0  On |                  N/A |
|  0%   39C    P8              15W / 300W |    279MiB / 11264MiB |      3%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1248      G   /usr/lib/xorg/Xorg                           35MiB |
|    0   N/A  N/A      2239      G   /usr/lib/xorg/Xorg                          112MiB |
|    0   N/A  N/A      2435      G   /opt/teamviewer/tv_bin/TeamViewer             2MiB |
|    0   N/A  N/A      2459      G   /usr/bin/gnome-shell                         79MiB |
|    0   N/A  N/A     32719      G   ...sion,SpareRendererForSitePerProcess       37MiB |
(env_tao) minibeast@miniBeast:/mnt/NVME_DATA/env_Projects/viewer_ws/viewer_tao_segmentation$ docker run --runtime=nvidia -it -v /mnt/NVME_DATA/Training_Sessions/TAO_experiments:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash

=======================
=== TAO Toolkit Deploy ===
=======================

NVIDIA Release 5.0.0-Deploy (build 52693241)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

root@51688c9b47d0:/opt/nvidia# mask_rcnn inference_trt -i /workspace/tao_lleida_canopy/data/infer_samples -e /workspace/tao_lleida_canopy/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao_lleida_canopy/mask_rcnn/export/model.epoch-483.engine -o /workspace/tao_lleida_canopy/mask_rcnn/inference -l /workspace/tao_lleida_canopy/mask_rcnn/inference -t 0.5 --include_mask
2023-11-17 15:32:14,625 [INFO] matplotlib.font_manager: generated new fontManager
Loading uff directly from the package source code
usage: mask_rcnn [-h] [--gpu_index GPU_INDEX] [--log_file LOG_FILE] {evaluate,gen_trt_engine,inference} ...
mask_rcnn: error: invalid choice: 'inference_trt' (choose from 'evaluate', 'gen_trt_engine', 'inference')
root@51688c9b47d0:/opt/nvidia#

Topic		Replies	Views
Mask-RCNN int8 Version Results in Poor Performance TAO Toolkit	37	1005	July 6, 2022
Custom TAO unet model classifying only two classes on Deepstream! TAO Toolkit	34	1704	May 12, 2022
The effect is very poor when converted to trt TAO Toolkit tensorrt , ubuntu	61	1303	September 11, 2023
No CUDA-capable device is detected on tao detectnet_v2 dataset convert TAO Toolkit pycuda , omniverse_extension	13	6172	January 4, 2022
Falure to do inference TAO Toolkit tensorrt	9	1071	January 11, 2022
No CUDA-capable device is detected - yolov4 TAO Toolkit	10	144	August 16, 2024
Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match TAO Toolkit	20	2219	June 11, 2024
Inferring detectnet_v2 .trt model in python TAO Toolkit tensorrt	58	3572	August 17, 2021
TAO 5.0 Classification (PyTorch) deploy error TAO Toolkit	49	1450	September 11, 2023
Error in TAO-Toolkit while training TAO Toolkit	15	1513	July 6, 2022

TAO MaskRCNN inference output problem

1. If we run inference()

2. If we run inference_trt()

Related topics