TensorRT small model high RAM consumption during inference problem

Hello,

We are facing some problems regarding the model conversion of one of our ONNX models.

The models are converted from ONNX to TRT by parsing the ONNX format and serializing it as a TRT engine. In this process we are also converting FP32 to FP16. The converted model is about 4MB size at rest but consumes about 1.1GB of system RAM (CPU+GPU) when it’s loaded. This is very strange as we have the same model running in a Jetson TX2 device and at that device it only requires about 70MB. The ONNX model and the conversion script is the same for both devices.

The environment we are using for Jetson Orin Nano is:

  • L4T: 35.5.0
  • JetPack: 5.1.3
  • CUDA: 11.4.19
  • TensorRT: 8.5.2

So far, we have done the following tests:

  1. Try using newer Jetpack

We have tested a newer Jetpack using nvcr.io/nvidia/l4t-ml:r36.2.0-py3 docker on the Orin Nano. When applying the same ONNX->TRT conversion and loading it its RAM footprint is about 15MB, even better than the TX2 device and almost two orders of magnitude less than native JetPack. Unfortunately we cannot update to this JetPack version at this time, we must continue using 5.1.3.

  1. Test different sets of tactics

As model loading shows a message like “INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +18, now: CPU 1073, GPU 2673 (MiB)” we suspect about cuDNN and/or cuBLAS initialization. We have tried different sets of tactics and even deselected all of them without success.

  1. Added custom Allocator

We have created a custom allocator with the default implementation but logging all memory allocation requests trying to find a clue about this high memory consumption. However, only one memory allocation request is caught this way of about 4MB, very far from the 1.1GB consumed.

  1. Limit workspace

We have also tried to limit the RAM for workspace using setMaxWorkspaceSize without success.

  1. Tried conversion from onnx using the trtexec tool.

After some time investigating the issue, we found what we think is the root cause. The ONNX model has InstanceNormalization layers. We believe that the conversion of this layer is not correctly supported in our current TensorRT version (8.5.2), as we found this information in the release notes of version 8.6.1:

INormalization layer has been added to support InstanceNormalization, GroupNormalization, and LayerNormalization operations in ONNX.

To test this, we decided to create a very simple ONNX model containing just an input layer of dimension (1, 3, 256, 256), one single InstanceNormalization intermediate layer, and the output layer:

When converting this model into a TRT engine and serializing it into a file, we see that it weighs just 2.4 KB on the disk. However, the RAM usage grows dramatically to 940 MB when loading it for inference. So it looks like this layer is not well supported for our TensorRT version in the Jetson Orin Nano. During the conversion, we see this debug information related to this layer:

VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 2

Here are the onnx and tensorrt engine files of this simplified model, in case you want to inspect them:
models.zip (1.4 KB)

Interestingly, we have been able to convert and use our model in a Jetson TX2 with the older version of TensorRT 8.0.1 without the high RAM use observed in the Orin, despite finding this information during the conversion:

VERBOSE: ModelImporter.cpp:125: InstanceNormalization_135 [InstanceNormalization] inputs: [417 -> (-1, 16, -1)], [418 -> (16)], [419 -> (16)],
ERROR: builtin_op_importers.cpp:1595 In function importInstanceNormalization:
[8] Assertion failed: !isDynamic(tensorPtr->getDimensions()) && "InstanceNormalization does not support dynamic inputs!"

Do you know why we may be facing this issue? Is there any workaround we can try?

Thank you.

3 Likes

Hi,

The memory is used for loading the CUDA-related binary which might take >600MB.
To fix this issue, we introduce a new lazy loading feature in CUDA 11.8 to avoid non-necessary binary loading when initiating.

Is CUDA 11.8 an option for you?
You can upgrade from JetPack 5.1.3 without reflashing.

Thanks.

Hello, thank you for your reply.

We believe that the memory increase is not due to loading the CUDA-related binaries; we loaded another engine first to ensure that the binaries were loaded prior to the measurement.

Could the problem be the InstanceNormalization plugin? Is there any workaround to avoid using this plugin?

I have managed to install cuda 11.8 with the link you provided, however, when installing TensorRT (via apt-get install python3-libnvinfer) I see that apt is also installing cuda 11.4 back in my system, instead of using the installation of cuda 11.8.

Hi,

Unfortunately, only CUDA is upgradable on JetPack 5.

Just check the InstanceNormalization implementation, it looks like cuDNN is required.
This might be the cause of the large memory.

Have you tried to remove cuDNN dependencies when converting a TensorRT engine?

Thanks.

Hello,

We have tried converting the model disabling the CUDNN tactic source following the python API docs.

Unfortunately, this didn’t improve memory usage at runtime. Is this enough to remove the cuDNN dependency at runtime? Or is there anything else we can try to rule out this as the source of the problem?

Hi,

Sorry for the late update.

Could you share the log when you converted the model into TensorRT with verbose enabled?
It looks like the plugin requires cuDNN dependencies, so it’s expected to see some error or alternative when converting.

Thanks.

Hello,

Thank you for your reply. Here are the conversion logs, setting the TRT logger level in verbose mode:

With CUDNN tactic enabled:
conversion.log (13.6 MB)

With CUDNN tactic disabled:
conversion2.log (13.6 MB)

Hi,

This is what we saw in your log (the two logs are somehow similar):

[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_0 + Relu_1 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_2 + Relu_3 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_4 + Relu_5 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: MaxPool_6 Host Persistent: 1280 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_7 + Relu_8 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_9 + Add_10 + Relu_11 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_12 + Relu_13 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_14 + Add_15 + Relu_16 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_17 + Relu_18 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_19 + Add_20 + Relu_21 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_22 + Relu_23 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: AveragePool_25 Host Persistent: 1280 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_24 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_26 + Add_27 + Relu_28 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_29 + Relu_30 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_31 + Add_32 + Relu_33 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_34 + Relu_35 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_36 + Add_37 + Relu_38 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_39 + Relu_40 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_41 + Add_42 + Relu_43 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_44 + Relu_45 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_46 + Add_47 + Relu_48 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_49 + Relu_50 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: AveragePool_52 Host Persistent: 1280 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_51 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_53 + Add_54 + Relu_55 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_56 + Relu_57 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_58 + Add_59 + Relu_60 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_61 + Relu_62 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_63 + Add_64 + Relu_65 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_66 + Relu_67 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: AveragePool_69 Host Persistent: 1280 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_68 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_70 + Add_71 + Relu_72 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_73 + Relu_74 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_75 + Add_76 + Relu_77 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_80 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_79 + Add_100 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_78 + Add_120 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_121 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_124 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_130 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_122 + Add_125 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_126 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_128 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: InstanceNormalization_135 Host Persistent: 112 Device Persistent: 0 Scratch Memory: 128
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_123 + Add_127 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_172 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_129 Host Persistent: 3136 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: InstanceNormalization_177 Host Persistent: 112 Device Persistent: 0 Scratch Memory: 128
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_214 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_141 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: InstanceNormalization_219 Host Persistent: 112 Device Persistent: 0 Scratch Memory: 128
[10/15/2024-19:00:02] [TRT] [V] Layer: InstanceNormalization_146 Host Persistent: 112 Device Persistent: 0 Scratch Memory: 128
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_183 Host Persistent: 2624 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: InstanceNormalization_188 Host Persistent: 112 Device Persistent: 0 Scratch Memory: 128
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_225 Host Persistent: 2192 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_153 + bbox_head.scales.0.scale + (Unnamed Layer* 166) [Shuffle] + Mul_154 || Conv_155 || Conv_152 Host Persistent: 512 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: InstanceNormalization_230 Host Persistent: 112 Device Persistent: 0 Scratch Memory: 128
[10/15/2024-19:00:02] [TRT] [V] Layer: PWN(Sigmoid_163) Host Persistent: 244 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_195 + bbox_head.scales.1.scale + (Unnamed Layer* 190) [Shuffle] + Mul_196 || Conv_197 || Conv_194 Host Persistent: 3264 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: Conv_237 + bbox_head.scales.2.scale + (Unnamed Layer* 204) [Shuffle] + Mul_238 || Conv_239 || Conv_236 Host Persistent: 2192 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: PWN(Sigmoid_205) Host Persistent: 244 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Layer: PWN(Sigmoid_247) Host Persistent: 244 Device Persistent: 0 Scratch Memory: 0
[10/15/2024-19:00:02] [TRT] [V] Skipped printing memory information for 39 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
[10/15/2024-19:00:02] [TRT] [I] Total Host Persistent Memory: 161088
[10/15/2024-19:00:02] [TRT] [I] Total Device Persistent Memory: 0
[10/15/2024-19:00:02] [TRT] [I] Total Scratch Memory: 128

Instant normalization does increase memory by 128Mb. Is this what you want to save?
This is cache memory and is calculated based on the hardware info:

Thanks.

Hello, thank you for your reply.

In our case, we see an increase of 900 MB when loading the engine for inference, even for a tiny model with just one layer of InstanceNorm as explained in the post. Is there any way we can obtain logs for the engine load? Maybe that can help us find more information about the issue.

Hi,

900MB sounds like the memory used for loading CUDA binary.
Especially the issue cannot be reproduced on r36 which already has the lazy loading feature.

What kind of the log you want to get?
If you want to save memory, it’s still recommended to upgrade to a newer software.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.