Issue while converting maskrcnn model to trt from etlt on Laptops

• Hardware (RTX 3070Ti / RTX 3090 / RTX 3080Ti/A600] Tested on these 4 hardware.
• Network Type (Mask_cnn)
• TLT Version: nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3

• How to reproduce the issue ?

I am facing an issue with the conversion of the mask_rcnn .etlt model to the .engine file. I trained a model on a 3090 PC and exported and converted it successfully on the machine to an engine file without a problem.

After that, I transferred the exported model ( .etlt ) to my laptop and tried to convert it to the engine. However, I fail to do so. Apparently, there aren’t any logs to help me with the issue.

I ran the same command with the same weight on 2 PCs ( 3090/A6000) and both convert the model successfully. When I ran the same command with the same weight on 2 laptops ( 3070Ti/3080Ti ) conversion fails on both.

I am attaching the error log and command I use to convert below:

command:

!tao converter -k nvidia_tlt  \
                   -d 3,832,1344 \
                   -o generate_detections,mask_fcn_logits/BiasAdd \
                   -e /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrain/export/trt.fp16.engine \
                   -t fp16 \
                   -i nchw \
                   -m 1 \
                   /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrain/model.step-32400.etlt

logs on PCs

[INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 1031 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 1031 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +750, GPU +318, now: CPU 1669, GPU 1349 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +618, GPU +268, now: CPU 2287, GPU 1617 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 248000
[INFO] Total Device Persistent Memory: 84687872
[INFO] Total Scratch Memory: 53721600
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 162 MiB, GPU 32 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3309, GPU 2113 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3309, GPU 2121 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3309, GPU 2109 (MiB)
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 3308, GPU 2093 (MiB)
[INFO] [MemUsageSnapshot] Builder end: CPU 3237 MiB, GPU 2093 MiB
2022-05-26 19:21:20,637 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

logs on laptops:

INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 417 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 417 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1669, GPU 735 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +618, GPU +268, now: CPU 2287, GPU 1003 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
2022-05-26 19:24:19,300 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

As shown the logs don’t really show anything. I tried this on diff laptops and all led to a similar result.

Things I have tried:

  • Made sure -K key is correct, the path to etlt model is correct and mapped properly.
  • tried with -s parameter ( same result )
  • adjusted -w parameter but no difference.
  • tried with -v for verbose log but got error of no-arg -v

I need to convert this model on the laptops as well for inference, please help.

PS: I intentionally messed up the path to etlt model so I can be sure of the error log. If I mess up etlt path I get this:

Unsupported number of graph 0
[ERROR] Failed to parse the model, please check the encoding key to make sure it's correct
[ERROR] 4: [network.cpp::validate::2411] Error Code 4: Internal Error (Network must have at least one output)
[ERROR] Unable to create engine
2022-05-26 20:38:03,639 [INFO] tlt.components.docker_handler.docker_handler: Stopping container

But this is not happening with me with the correct etlt path so that’s also not an issue. This feels like a weird problem.

For your laptops, could you download tao-converter to convert the .etlt model?
https://docs.nvidia.com/tao/tao-toolkit/text/tensorrt.html#installing-the-tao-converter

I tried that. Took me a lot of time to get it to work but now it gave the following error.

[ERROR] UffParser: Validator error: pyramid_crop_and_resize_mask: Unsupported operation _MultilevelCropAndResize_TRT

I understand this is due to libnvinfer_plugin not having this method.

Things I have done -

  • I installed pip wheel version of tensorrt. (pip install nvidia-tensorrt==8.0.1.6]
    Then I build tensortrt OSS and replaced the libnvinfer_plugin.so.8.0.1 in my virtual python env tensorrt package.
    After that, I exported my TRT_LIB_PATHs.

  • Then I got libcrypto no such file or directory error
    So I put this in my bashrc export LD_LIBRARY_PATH=/usr/local/cuda-11.6/nsight-systems-2021.5.2/host-linux-x64/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} as libcrypto package was inside this folder.

  • Then I got onnxparser error so I ran this command.
    sudo apt-get install libnvinfer8 libnvonnxparsers8 libnvparsers8 libnvinfer-plugin8

This resolved the above error of onnxparser and libnvinfer parser but now I have pyramid_crop_and_resize_mask: Unsupported operation _MultilevelCropAndResize_TRT this error.

As mentioned, I tried replacing the libnvinfer_plugin.so.8.0.1 but it did not help. It seems like building tensorrt==8.0.1.6 is a pain.

other relevant info:

OS: Ubuntu 22.0.4
Cuda 11.6.2.
GPU_ARCH: 86 (3070Ti]

Please follow MaskRCNN — TAO Toolkit 3.22.05 documentation or deepstream_tao_apps/TRT-OSS/x86 at master · NVIDIA-AI-IOT/deepstream_tao_apps · GitHub to build a new libnvinfer_plugin.so , then replace with the old one.

I have done exactly that. It even successfully generated the libnvinfer file and yet I faced the same issue.
I replaced and ran sudo ldconfig.

All the steps I followed are mentioned here

Actually the user guide does not mention some steps you mentioned. For example, (pip install nvidia-tensorrt==8.0.1.6]. Could you double check?

Alright so that was because I installed pip version of tensorrt. However, now I purged everything to start from scratch.

After going for fresh installation I ran this command
/usr/local/bin/cmake .. -DGPU_ARCHS=86 -DTRT_LIB_DIR=/usr/lib/x86_64-linux-gnu/ -DCMAKE_C_COMPILER=/usr/bin/gcc -DTRT_BIN_DIR=pwd/out
as mentioned in docs and encountered this error:

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
CUDNN_LIB
    linked by target "nvinfer_plugin" in directory /home/alaap/TensorRT/plugin
    linked by target "sample_algorithm_selector" in directory /home/alaap/TensorRT/samples/sampleAlgorithmSelector
    linked by target "sample_char_rnn" in directory /home/alaap/TensorRT/samples/sampleCharRNN
    linked by target "sample_dynamic_reshape" in directory /home/alaap/TensorRT/samples/sampleDynamicReshape
    linked by target "sample_fasterRCNN" in directory /home/alaap/TensorRT/samples/sampleFasterRCNN
    linked by target "sample_googlenet" in directory /home/alaap/TensorRT/samples/sampleGoogleNet
    linked by target "sample_int8" in directory /home/alaap/TensorRT/samples/sampleINT8
    linked by target "sample_int8_api" in directory /home/alaap/TensorRT/samples/sampleINT8API
    linked by target "sample_mlp" in directory /home/alaap/TensorRT/samples/sampleMLP
    linked by target "sample_mnist" in directory /home/alaap/TensorRT/samples/sampleMNIST
    linked by target "sample_mnist_api" in directory /home/alaap/TensorRT/samples/sampleMNISTAPI
    linked by target "sample_nmt" in directory /home/alaap/TensorRT/samples/sampleNMT
    linked by target "sample_onnx_mnist" in directory /home/alaap/TensorRT/samples/sampleOnnxMNIST
    linked by target "sample_reformat_free_io" in directory /home/alaap/TensorRT/samples/sampleReformatFreeIO
    linked by target "sample_ssd" in directory /home/alaap/TensorRT/samples/sampleSSD
    linked by target "sample_uff_fasterRCNN" in directory /home/alaap/TensorRT/samples/sampleUffFasterRCNN
    linked by target "sample_uff_maskRCNN" in directory /home/alaap/TensorRT/samples/sampleUffMaskRCNN
    linked by target "sample_uff_mnist" in directory /home/alaap/TensorRT/samples/sampleUffMNIST
    linked by target "sample_uff_plugin_v2_ext" in directory /home/alaap/TensorRT/samples/sampleUffPluginV2Ext
    linked by target "sample_uff_ssd" in directory /home/alaap/TensorRT/samples/sampleUffSSD
    linked by target "sample_onnx_mnist_coord_conv_ac" in directory /home/alaap/TensorRT/samples/sampleOnnxMnistCoordConvAC
    linked by target "trtexec" in directory /home/alaap/TensorRT/samples/trtexec
TENSORRT_LIBRARY_INFER
    linked by target "nvonnxparser_static" in directory /home/alaap/TensorRT/parsers/onnx
    linked by target "nvonnxparser" in directory /home/alaap/TensorRT/parsers/onnx
TENSORRT_LIBRARY_INFER_PLUGIN
    linked by target "nvonnxparser_static" in directory /home/alaap/TensorRT/parsers/onnx
    linked by target "nvonnxparser" in directory /home/alaap/TensorRT/parsers/onnx

-- Configuring incomplete, errors occurred!

Follow up on this, Since I was not able to install tensorRT properly on my local machine to actually run tao-converter, I used a docker image for the same.
Image had ubuntu 20.04, cuda 11.4, and tensorRT 8.0.1

I successfully built TRT OSS nvinfer_plugin and then ran the following command:

./tao-converter -k nvidia_tlt -d 3,832,1344 -o generate_detections,mask_fcn_logits/BiasAdd -e export/trt_newpep.fp16.engine -m 1 -t fp16 -i nchw model.step-32400.etlt

Output is still the same.

[INFO] [MemUsageChange] Init CUDA: CPU +534, GPU +0, now: CPU 540, GPU 417 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 417 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1668, GPU 735 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +191, GPU +324, now: CPU 1859, GPU 1059 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.

It stops after this much like tao docker. No output file is generated,

EDIT:
I managed to run everything on my local as well.

I ran the same command and here is the output:

[INFO] [MemUsageChange] Init CUDA: CPU +533, GPU +0, now: CPU 540, GPU 417 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 417 MiB
[WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.5.1 but loaded cuBLAS/cuBLAS LT 110.9.2
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +871, GPU +378, now: CPU 1791, GPU 795 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +127, GPU +60, now: CPU 1918, GPU 855 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead

But weirdly enough my terminal crashes after 2-3 mins and no output is generated. The same thing happened in docker as well, docker crashed and no output was generated. Shows nothing in logs as well.

I revisit above description mentioned in the very beginning. So, all runs well with 2 PCs ( 3090/A6000). But failed in 2 laptops ( 3070Ti/3080Ti ) . Are the laptops running with WSL?
Is it enough gpu memory?

Yes, they work fine on both the PCs but fail on laptops without any logs. When I run ./tao-converter it crashes the terminal itself with logs as mentioned above.

Here are a detailed description of all the Machines I tried on.

PCs:

  1. 3090 GPU (24GB) VRAM running Ubuntu 20.04 and Cuda 11.6
  2. A6000 [48GB] VRAM running Ubuntu 20.04 and Cuda 11.6

Laptops:

  1. 3070Ti GPU [8 GB] VRAM running Ubuntu 22.04 and Cuda 11.6
  2. 3080Ti GPU [16GB] VRAM running Ubuntu 20.04 and Cuda 11.6

Tao version is the same across the platforms.

NO, clean Ubuntu.

8GB and 16GB.

Hopefully, that’s enough.

OK, the laptops’ gpu memory is much smaller.

You can try to run another experiment to check if the trt engine can be generated.
Open temimal , login the tao docker.
$ tao mask_rcnn run /bin/bash

Then inside the docker, generate trt engine.
# converter -k nvidia_tlt -d xxx …

It is smaller but it is enough I guess. I have converted models in past on 2070 PC with 8GB ram. Moreover, it only uses 2.5GB of VRAM.

I tried that and ran
converter -k nvidia_tlt -d 3,832,1344 -o generate_detections,mask_fcn_logits/BiasAdd -e /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrain/export/trt_newpep.fp16.engine -t fp16 -i nchw -m 1 /workspace/tao-experiments/mask_rcnn/experiments/experiment_dir_retrain/model.step-32400.etlt:

Here are the logs. Output is similar. Docker crashes and exits. No engine is generated, logs say nothing.

[INFO] [MemUsageChange] Init CUDA: CPU +536, GPU +0, now: CPU 542, GPU 417 (MiB)
[INFO] [MemUsageSnapshot] Builder begin: CPU 848 MiB, GPU 417 MiB
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +749, GPU +318, now: CPU 1669, GPU 735 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +619, GPU +268, now: CPU 2288, GPU 1003 (MiB)
[WARNING] Detected invalid timing cache, setup a local cache instead
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
2022-05-31 22:30:53,402 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

If possible, please try with more gpu memory.

I tried with a 16GB GPU memory laptop, and laptops don’t come with more GPU memory than that. It is not even using 2GB of memory usage. Moreover, I have converted it on a 2070 PC which had 8GB memory and it was successful there as well. Thus, memory does not seem to be a constraint.

Also, I tried with a different etlt mrcnn weight file as well just to be sure .etlt isn’t broken,

Could you please check if there are any difference between laptops and PCs?

  • nvidia driver
  • CUDA/TensorRT/Cudnn version
  • etc.

Everything stays consistent across the machines.

Nvidia-driver on all the machines is 510.47
Cuda version is 11.6 across all machines. (cuda_11.6.r11.6)
Tensor RT version is also consistent to 8.0.1.6 across all machines.

tao version is also consistent as mentioned

I have tried with both tao docker and building ./tao-converter as advised.

Could you please try another official maskrcnn etlt model in your laptops?
Please download the models from PeopleSegNet | NVIDIA NGC
Please note that the resolution is 960 X 576 . The ngc key is nvidia_tlt.

Please check loptop’s cpu memory. For mask_rcnn, during the conversion for fp16, the RAM usage peaks around 80G.

same thing, this also does not convert and no logs.

It is 16GB and 32GB. I don’t think laptops have 80GB RAM anywhere, they peak out at 32GB or 64GB at best. Is it possible to convert etlt model to engine file on a laptop for deployment? This seems to be a pretty standard use case as engine files are meant to be deployed on jetson devices or some laptops maybe in some cases.

Please let me know if there is any possible way to convert etlt to engine on a laptop.

So, the laptops cannot meet the CPU memory during the mask_rcnn engine conversion.

One more experiment, please set "-s " in the command line. We found that with this “-s” , for fp16 mode, the RAM usage peaks around 40G.

We also find that with “-s” , under int8 mode, the RAM usage peaks around 4G.

So, two workaround here.

  1. For laptops, please use “-s” and int8 mode
  2. For laptops, if all the cuda/cudnn/TensorRT version are the same as PC’s, you can directly copy the .engine file generated by PCs.