Hello,
We are facing some problems regarding the model conversion of one of our ONNX models.
The models are converted from ONNX to TRT by parsing the ONNX format and serializing it as a TRT engine. In this process we are also converting FP32 to FP16. The converted model is about 4MB size at rest but consumes about 1.1GB of system RAM (CPU+GPU) when it’s loaded. This is very strange as we have the same model running in a Jetson TX2 device and at that device it only requires about 70MB. The ONNX model and the conversion script is the same for both devices.
The environment we are using for Jetson Orin Nano is:
- L4T: 35.5.0
- JetPack: 5.1.3
- CUDA: 11.4.19
- TensorRT: 8.5.2
So far, we have done the following tests:
- Try using newer Jetpack
We have tested a newer Jetpack using nvcr.io/nvidia/l4t-ml:r36.2.0-py3 docker on the Orin Nano. When applying the same ONNX->TRT conversion and loading it its RAM footprint is about 15MB, even better than the TX2 device and almost two orders of magnitude less than native JetPack. Unfortunately we cannot update to this JetPack version at this time, we must continue using 5.1.3.
- Test different sets of tactics
As model loading shows a message like “INFO: [MemUsageChange] Init cuDNN: CPU +0, GPU +18, now: CPU 1073, GPU 2673 (MiB)” we suspect about cuDNN and/or cuBLAS initialization. We have tried different sets of tactics and even deselected all of them without success.
- Added custom Allocator
We have created a custom allocator with the default implementation but logging all memory allocation requests trying to find a clue about this high memory consumption. However, only one memory allocation request is caught this way of about 4MB, very far from the 1.1GB consumed.
- Limit workspace
We have also tried to limit the RAM for workspace using setMaxWorkspaceSize without success.
- Tried conversion from onnx using the trtexec tool.
After some time investigating the issue, we found what we think is the root cause. The ONNX model has InstanceNormalization layers. We believe that the conversion of this layer is not correctly supported in our current TensorRT version (8.5.2), as we found this information in the release notes of version 8.6.1:
INormalization layer has been added to support InstanceNormalization, GroupNormalization, and LayerNormalization operations in ONNX.
To test this, we decided to create a very simple ONNX model containing just an input layer of dimension (1, 3, 256, 256), one single InstanceNormalization intermediate layer, and the output layer:
When converting this model into a TRT engine and serializing it into a file, we see that it weighs just 2.4 KB on the disk. However, the RAM usage grows dramatically to 940 MB when loading it for inference. So it looks like this layer is not well supported for our TensorRT version in the Jetson Orin Nano. During the conversion, we see this debug information related to this layer:
VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 1
VERBOSE: Registered plugin creator - ::InstanceNormalization_TRT version 2
Here are the onnx and tensorrt engine files of this simplified model, in case you want to inspect them:
models.zip (1.4 KB)
Interestingly, we have been able to convert and use our model in a Jetson TX2 with the older version of TensorRT 8.0.1 without the high RAM use observed in the Orin, despite finding this information during the conversion:
VERBOSE: ModelImporter.cpp:125: InstanceNormalization_135 [InstanceNormalization] inputs: [417 -> (-1, 16, -1)], [418 -> (16)], [419 -> (16)],
ERROR: builtin_op_importers.cpp:1595 In function importInstanceNormalization:
[8] Assertion failed: !isDynamic(tensorPtr->getDimensions()) && "InstanceNormalization does not support dynamic inputs!"
Do you know why we may be facing this issue? Is there any workaround we can try?
Thank you.