TAO MaskRCNN inference output problem

Using inference in place of inference_trt threw an error:
mask_rcnn inference: error: invalid choice: '/workspace/tao_lleida_canopy/mask_rcnn/inference' (choose from 'evaluate', 'gen_trt_engine', 'inference')

Update the command here. Please refer to Mask RCNN with TAO Deploy - NVIDIA Docs.

$ docker run --runtime=nvidia -it -v local_folder:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash
then,

mask_rcnn inference  -i /workspace/tao-experiments/data/infer_samples -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao-experiments/mask_rcnn/export/model.epoch-483.engine -r  /workspace/tao-experiments/mask_rcnn/inference 
root@51688c9b47d0:/opt/nvidia# mask_rcnn inference  -i /workspace/tao_lleida_canopy/data/infer_samples -e /workspace/tao_lleida_canopy/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao_lleida_canopy/mask_rcnn/export/model.epoch-483.engine -r  /workspace/tao_lleida_canopy/mask_rcnn/inference
Loading uff directly from the package source code
usage: mask_rcnn inference [-h] [--gpu_index GPU_INDEX] [--log_file LOG_FILE] [-i IMAGE_DIR] -e EXPERIMENT_SPEC -m MODEL_PATH -r RESULTS_DIR -c CLASS_MAP [-t THRESHOLD]
                           {evaluate,gen_trt_engine,inference} ...
mask_rcnn inference: error: the following arguments are required: -c/--class_map

OK, will modify the user guide which does not align with notebook. https://github.com/NVIDIA/tao_tutorials/blob/95aca39c79cb9068593a6a9c3dcc7a509f4ad786/notebooks/tao_launcher_starter_kit/mask_rcnn/maskrcnn.ipynb.
From notebook,

Please

mask_rcnn inference  -i /workspace/tao-experiments/data/infer_samples -e /workspace/tao-experiments/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao-experiments/mask_rcnn/export/model.epoch-483.engine -r  /workspace/tao-experiments/mask_rcnn/inference -c labels.txt -t 0.5

The label.txt is the class list. An example can be found in https://github.com/NVIDIA/tao_tutorials/blob/95aca39c79cb9068593a6a9c3dcc7a509f4ad786/notebooks/tao_launcher_starter_kit/mask_rcnn/specs/coco_labels.txt.

Hi Morgan,

We ran the fixed command as you noted, and got the same error:

root@51688c9b47d0:/opt/nvidia# mask_rcnn inference  -i /workspace/tao_lleida_canopy/data/infer_samples -e /workspace/tao_lleida_canopy/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao_lleida_canopy/mask_rcnn/export/model.epoch-483.engine -r  /workspace/tao_lleida_canopy/mask_rcnn/inference -c /workspace/tao_lleida_canopy/mask_rcnn/experiment_dir_retrain/labels.txt
Loading uff directly from the package source code
2023-11-19 13:57:13,446 [TAO Toolkit] [INFO] root 174: Starting mask_rcnn inference.
Producing predictions:   0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s][11/19/2023-13:57:16] [TRT] [E] 1: [reformat.cpp::executeCutensor::388] Error Code 1: CuTensor (Internal cuTensor permutate execute failed)
[11/19/2023-13:57:16] [TRT] [E] 1: [checkMacros.cpp::catchCudaError::202] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
Producing predictions:   0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s]
2023-11-19 13:57:16,136 [TAO Toolkit] [INFO] root 174: cuMemcpyDtoHAsync failed: an illegal memory access was encountered
Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_deploy/cv/mask_rcnn/scripts/inference.py>", line 3, in <module>
  File "<frozen cv.mask_rcnn.scripts.inference>", line 228, in <module>
  File "<frozen cv.common.decorators>", line 63, in _func
  File "<frozen cv.common.decorators>", line 48, in _func
  File "<frozen cv.mask_rcnn.scripts.inference>", line 117, in main
  File "<frozen cv.mask_rcnn.inferencer>", line 136, in infer
  File "<frozen inferencer.utils>", line 81, in do_inference
pycuda._driver.LogicError: cuMemcpyDtoHAsync failed: an illegal memory access was encountered
[11/19/2023-13:57:16] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/19/2023-13:57:16] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/19/2023-13:57:16] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/19/2023-13:57:16] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/19/2023-13:57:16] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/19/2023-13:57:16] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
...

Question
Is the Docker image version of nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy that you asked us to test with different than the one in the TAO notebook (http://nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5, correct?

We also tried another approach where we cloned the GitHub repo (to another machine) and ran inference.py natively in a Python environment instead of inside the TAO Docker.
We successfully ran inference on a set of images that were known to break the Docker implementation, so that leads us to believe there might be a problem with the Dockerfile and its versions. Does that seem possible to you that there may exist a conflict between the TRT/CUDA/Docker/other requirements?

The nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy is to generate tensorrt engine and run inference/evaluation against the tensorrt engine. In short, it is used for deployment against tensorrt engine. Its source code is in GitHub - NVIDIA/tao_deploy: Package for deploying deep learning models from TAO Toolkit. Its docker file locates in https://github.com/NVIDIA/tao_deploy/blob/main/docker/Dockerfile. The TRT version is 8.5.3.
The http://nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5 is to train/evaluate/prune/export/etc. Its source code is GitHub - NVIDIA/tao_tensorflow1_backend: TAO Toolkit deep learning networks with TensorFlow 1.x backend. Its docker file is https://github.com/NVIDIA/tao_tensorflow1_backend/blob/main/docker/Dockerfile. The TRT version is also 8.5.3.

The are both from TAO 5.0 release.

Could you try to run another experiment?

$ docker run --runtime=nvidia -it -v local_folder:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash

then inside the docker, please use “gen_trt_engine” command to generate a new tensorrt engine and run inference again.

Please share full log and full command(including how to trigger via “docker run”, etc) with us.

(base) user@workstation:/mnt/Projects/ws/tao_segmentation$ docker run --runtime=nvidia -it -v /mnt/Sessions/TAO_experiments:/workspace --rm nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash

=======================
=== TAO Toolkit Deploy ===
=======================

NVIDIA Release 5.0.0-Deploy (build 52693241)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

root@33aca6942ed7:/opt/nvidia# mask_rcnn gen_trt_engine -m /workspace/tao_lleida_canopy/mask_rcnn/experiment_dir_retrain/model.epoch-483.uff --batch_size 1 --data_type fp16 --engine_file /workspace/tao_lleida_canopy/mask_rcnn/export_deploy/model.epoch-483.engine --results_dir /workspace/tao_lleida_canopy/mask_rcnn/export_deploy
2023-11-21 08:51:01,566 [INFO] matplotlib.font_manager: generated new fontManager
Loading uff directly from the package source code
Loading uff directly from the package source code
2023-11-21 08:51:02,777 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.common.logging.status_logging 198: Log file already exists at /workspace/tao_lleida_canopy/mask_rcnn/export_deploy/status.json
2023-11-21 08:51:02,778 [TAO Toolkit] [INFO] root 174: Starting mask_rcnn gen_trt_engine.
[11/21/2023-08:51:02] [TRT] [I] [MemUsageChange] Init CUDA: CPU +3, GPU +0, now: CPU 43, GPU 624 (MiB)
[11/21/2023-08:51:04] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +342, GPU +76, now: CPU 439, GPU 700 (MiB)
2023-11-21 08:51:04,462 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.mask_rcnn.engine_builder 100: Parsing UFF model
[11/21/2023-08:51:04] [TRT] [W] The implicit batch dimension mode has been deprecated. Please create the network with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag whenever possible.
2023-11-21 08:51:05,353 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 143: TensorRT engine build configurations:
2023-11-21 08:51:05,353 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 156:  
2023-11-21 08:51:05,353 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 158:   BuilderFlag.FP16
2023-11-21 08:51:05,353 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 172:   BuilderFlag.TF32
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 188:  
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 190:   Note: max representabile value is 2,147,483,648 bytes or 2GB.
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 192:   MemoryPoolType.WORKSPACE = 2147483648 bytes
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 194:   MemoryPoolType.DLA_MANAGED_SRAM = 0 bytes
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 196:   MemoryPoolType.DLA_LOCAL_DRAM = 1073741824 bytes
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 198:   MemoryPoolType.DLA_GLOBAL_DRAM = 536870912 bytes
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 200:  
2023-11-21 08:51:05,354 [TAO Toolkit] [INFO] nvidia_tao_deploy.engine.builder 208:   Tactic Sources = 31
[11/21/2023-08:51:05] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 704, GPU 710 (MiB)
[11/21/2023-08:51:05] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 705, GPU 720 (MiB)
[11/21/2023-08:51:05] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[11/21/2023-08:51:41] [TRT] [I] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[11/21/2023-08:53:18] [TRT] [I] Total Activation Memory: 2936632320
[11/21/2023-08:53:18] [TRT] [I] Detected 1 inputs and 2 output network tensors.
[11/21/2023-08:53:18] [TRT] [I] Total Host Persistent Memory: 244000
[11/21/2023-08:53:18] [TRT] [I] Total Device Persistent Memory: 1697280
[11/21/2023-08:53:18] [TRT] [I] Total Scratch Memory: 53721600
[11/21/2023-08:53:18] [TRT] [I] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 125 MiB, GPU 1912 MiB
[11/21/2023-08:53:18] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 145 steps to complete.
[11/21/2023-08:53:18] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 19.2535ms to assign 20 blocks to 145 nodes requiring 157352960 bytes.
[11/21/2023-08:53:18] [TRT] [I] Total Activation Memory: 157352960
[11/21/2023-08:53:18] [TRT] [I] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1303, GPU 796 (MiB)
[11/21/2023-08:53:18] [TRT] [I] [MemUsageChange] Init cuDNN: CPU +1, GPU +10, now: CPU 1304, GPU 806 (MiB)
[11/21/2023-08:53:18] [TRT] [W] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[11/21/2023-08:53:18] [TRT] [W] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[11/21/2023-08:53:18] [TRT] [W] Check verbose logs for the list of affected weights.
[11/21/2023-08:53:18] [TRT] [W] - 100 weights are affected by this issue: Detected subnormal FP16 values.
[11/21/2023-08:53:18] [TRT] [W] - 23 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[11/21/2023-08:53:18] [TRT] [I] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +57, GPU +54, now: CPU 57, GPU 54 (MiB)
2023-11-21 08:53:18,507 [TAO Toolkit] [INFO] root 70: Export finished successfully.
2023-11-21 08:53:18,509 [TAO Toolkit] [INFO] root 174: Gen_trt_engine finished successfully.
2023-11-21 08:53:18,820 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Sending telemetry data.
2023-11-21 08:53:23,513 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Execution status: PASS
root@33aca6942ed7:/opt/nvidia# mask_rcnn inference  -i /workspace/tao_lleida_canopy/data/infer_samples -e /workspace/tao_lleida_canopy/mask_rcnn/specs/maskrcnn_retrain_resnet50.txt -m /workspace/tao_lleida_canopy/mask_rcnn/export_deploy/model.epoch-483.engine -r  /workspace/tao_lleida_canopy/mask_rcnn/inference -c /workspace/tao_lleida_canopy/mask_rcnn/experiment_dir_retrain/labels.txt
Loading uff directly from the package source code
2023-11-21 08:54:05,898 [TAO Toolkit] [INFO] nvidia_tao_deploy.cv.common.logging.status_logging 198: Log file already exists at /workspace/tao_lleida_canopy/mask_rcnn/inference/status.json
2023-11-21 08:54:05,898 [TAO Toolkit] [INFO] root 174: Starting mask_rcnn inference.
Producing predictions:   0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s][11/21/2023-08:54:06] [TRT] [E] 1: [reformat.cpp::executeCutensor::388] Error Code 1: CuTensor (Internal cuTensor permutate execute failed)
[11/21/2023-08:54:06] [TRT] [E] 1: [checkMacros.cpp::catchCudaError::202] Error Code 1: Cuda Runtime (an illegal memory access was encountered)
Producing predictions:   0%|                                                                                                                                                         | 0/1 [00:00<?, ?it/s]
2023-11-21 08:54:06,150 [TAO Toolkit] [INFO] root 174: cuMemcpyDtoHAsync failed: an illegal memory access was encountered
Traceback (most recent call last):
  File "</usr/local/lib/python3.8/dist-packages/nvidia_tao_deploy/cv/mask_rcnn/scripts/inference.py>", line 3, in <module>
  File "<frozen cv.mask_rcnn.scripts.inference>", line 228, in <module>
  File "<frozen cv.common.decorators>", line 63, in _func
  File "<frozen cv.common.decorators>", line 48, in _func
  File "<frozen cv.mask_rcnn.scripts.inference>", line 117, in main
  File "<frozen cv.mask_rcnn.inferencer>", line 136, in infer
  File "<frozen inferencer.utils>", line 81, in do_inference
pycuda._driver.LogicError: cuMemcpyDtoHAsync failed: an illegal memory access was encountered
[11/21/2023-08:54:06] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
[11/21/2023-08:54:06] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (CUDA-capable device(s) is/are busy or unavailable)
...
[11/21/2023-08:54:07] [TRT] [E] /workspace/trt_oss_src/TensorRT/plugin/common/plugin.h (134) - Cuda Error in ~CudaBind: 46 (CUDA-capable device(s) is/are busy or unavailable)
terminate called after throwing an instance of 'nvinfer1::plugin::CudaError'
  what():  std::exception
2023-11-21 08:54:07,207 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Sending telemetry data.
2023-11-21 08:54:11,802 [WARNING] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Execution status: FAIL

From the log, the tensorrt engine is generated successfully. But there is above failure during inference.
Did you ever build any plugin and replace?

Hello Morgan,

We have not built any plugins for TAO/TRT/CUDA/etc.
However, one thing we noticed was that the CUDA versions between the Docker implementation and our local system where we ran TAO natively were different:

  • Local system: CUDA 12.1
  • TAO Docker: CUDA 12.0

Also, the CUDA version of the base system running the Docker TAO is 12.2 with NVidia driver 535.129.03.
Could the problem be due to version mismatch?

Thanks for your time,
P

Is it possible to share the .uff model and spec file(maskrcnn_retrain_resnet50.txt) with me? I can use it to try to reproduce on my side. You can send the model via private message.

Hi, model shared with a DM.

Thanks. I will check further with your model.
BTW, please try on your side to check the memory if it is enough when your run inference? If it is not enough, please temporarily increase the “SWAP” Memory in the Linux system. Refer to Issue while converting maskrcnn model to trt from etlt on Laptops - #23 by alaapdhall79.

We have 32GB of RAM to run inference on images, of which only a few specific ones seem to break the process. We have not noticed any extraordinary RAM usage.
Inference is run on the same workstation (not laptop) and session that trained the model.

Hi,
I run inside the tao-deploy docker and generate tensorrt engine and run inference successfully.

$ docker run --runtime=nvidia -it --rm -v /home/morganh:/home/morganh nvcr.io/nvidia/tao/tao-toolkit:5.0.0-deploy /bin/bash
root@211289ed8c86:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577# mask_rcnn gen_trt_engine -m model.epoch-483.uff --batch_size 1 --data_type fp16 --engine_file model.epoch-483.engine --results_dir result
root@0260a59dc8fa:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577# mask_rcnn inference  -i /home/morganh/demo_3.0/forum_repro/maskrcnn/input -e maskrcnn_retrain_resnet50.txt -m model.epoch-483.engine -r  result_inference -c label.txt
2023-11-27 16:36:17,611 [INFO] matplotlib.font_manager: generated new fontManager
Loading uff directly from the package source code
2023-11-27 16:36:18,954 [TAO Toolkit] [INFO] root 174: Starting mask_rcnn inference.
[11/27/2023-16:36:20] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[11/27/2023-16:36:20] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
Producing predictions: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.60it/s]
2023-11-27 16:36:21,261 [TAO Toolkit] [INFO] root 151: Finished inference.
2023-11-27 16:36:21,271 [TAO Toolkit] [INFO] root 174: Inference finished successfully.
2023-11-27 16:36:21,820 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Sending telemetry data.
2023-11-27 16:36:27,309 [INFO] nvidia_tao_deploy.cv.common.entrypoint.entrypoint_proto: Execution status: PASS
root@0260a59dc8fa:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577# ll -rlt result_inference/
total 20
drwxr-xr-x 4 root root 4096 Nov 27 16:36 ../
drwxr-xr-x 4 root root 4096 Nov 27 16:36 ./
drwxr-xr-x 2 root root 4096 Nov 27 16:36 images_annotated/
drwxr-xr-x 2 root root 4096 Nov 27 16:36 labels/
-rw-r--r-- 1 root root  261 Nov 27 16:36 status.json
root@0260a59dc8fa:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577# ll -rlt result_inference/labels/
total 12
drwxr-xr-x 4 root root 4096 Nov 27 16:36 ../
-rw-r--r-- 1 root root 3164 Nov 27 16:36 test_img.json
drwxr-xr-x 2 root root 4096 Nov 27 16:36 ./
root@0260a59dc8fa:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577# ll -rlt result_inference/images_annotated/
total 44
drwxr-xr-x 4 root root  4096 Nov 27 16:36 ../
drwxr-xr-x 2 root root  4096 Nov 27 16:36 ./
-rw-r--r-- 1 root root 35890 Nov 27 16:36 test_img.jpg
root@0260a59dc8fa:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577# ll -rlt result_inference/images_annotated/test_img.jpg
-rw-r--r-- 1 root root 35890 Nov 27 16:36 result_inference/images_annotated/test_img.jpg
root@0260a59dc8fa:/home/morganh/demo_3.0/forum_repro/maskrcnn/forum_269577

I recall that you mentioned that “some images run through the inference process, others break it.”.

Does it mean you can run some images well but cannot run inference well on some specific images? If yes, could you share several images which are not well inferenced?

Hi Morgan,

Weirdly, the last couple of days the problem seems to have sorted itself out. We have not changed absolutely anything (as far as I can remember), and while retraining the same model, inference on the same data runs smoothly.

I would suggest to keep this topic open for a few more days, in case the problem returns.

Also, I’d like to add that the documentation on the inference task is still slightly misleading:

From the docs:

tao model mask_rcnn inference [-h] -i <input directory>
                             -o <output annotated image directory>
                             -e <experiment spec file>
                             -m <model file>
                             -k <key>
                             [-l <label file>]
                             [-t <bbox confidence threshold>]
                             [--include_mask]
                             [--gpu_index <gpu_index>]
                             [--log_file <log_file_path>]

What actually ran:

tao model mask_rcnn inference -i <input directory>
                                  -r <results directory>
                                  -e <experiment spec file>
                                  -m <model file>
                                  -c <label_file>
                                  -t <bbox confidence threshold>
                                  --include_mask

-l and -o flags throw errors in our case.

Thanks a lot for the info.

OK, we will improve the document.