High RAM consumption with CUDA and TensorRT on Jetson Xavier NX

Hello.

We are having issues with high memory consumption on Jetson Xavier NX especially when using TensorRT via ONNX RT.

By default our NN models are in FP32, so we tried converting to FP16 which makes the NN model smaller. However, during the model inference the memory consumption is the same as with FP32.

I did enable FP16 inference using ORT_TENSORRT_FP16_ENABLE=1 as suggested by ONNX Runtime documentation (TensorRT - onnxruntime). But it didn’t help.

Does Jetson Xavier NX support both FP16 and FP32, especially for CUDA and TensorRT?
Is there any other way to reduce the memory consumption when using CUDA and TensorRT?

We also found out that with any NN model the process using ONNX RT with CUDA execution provider utilizes at least 1.5GB of RAM and with TensorRT execution provider at least 2 GB of RAM. Is this expected?

Is there maybe any light version of the libraries to be used on a device with limited RAM like Jetson?

Thank you in advance
Marek

Hi,

Could you run your model with trtexec to test the memory required for TensorRT?

/usr/src/tensorrt/bin/trtexec --onnx=[your/model]                  #fp32
/usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16           #fp16

To inference with cuDNN (TensorRT), it requires at least 600 MB of memory for loading the library.
If you run it with ONNX Runtime, you need more memory to load the all required library.

Is pure TensorRT API an option for you?

Thanks.

Hi,

Thank you for prompt response.

I tried to test the same NN model as suggested and measured the memory consumption (Maximum resident set size (kbytes):) using “/usr/bin/time -v”.

trtexec with fp32 model: 1987695 KB
trtexec with fp16 model: 2010336 KB
ONNXRT → TensorRT with fp32 model: 2172236 KB

Could you please send me any sample ONNX model to test both fp32 and fp16 in trtexec and ONNX RT?

We would like to avoid using trtexec directly.

Regards
Marek

Using trtexec reduces the memory consumption by 150-200 MB comparing to ONNX Runtime with TRT. So this does not seem to be the issue.

In case of some NN models the RAM consumption is much lower when the models is loaded from existing engine file. Is it expected the model optimization (engine file creation) requires much more memory in comparison to inference only?

We have observed the FP16 inference is slower comparing to FP32. Is it possible? Our expectation is the performance would be at least the same, maybe better, but not worse.

And the last issue is the TRT optimization of FP16 model takes extremely long comparing to FP32. How is it possible?

My understanding is Jetson Xavier NX supports FP16 inference.

Thank you.
Best Regards

Hi,

Any feedback regarding this topic?

Thanks

Hi,

First, have you maximized the device performance before benchmarking?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Since TensorRT loads the corresponding library when required.
It’s possible that the memory consumption with or without conversion will be different.

In general, FP16 should be faster than FP32, but it might take a longer time for conversion.
Do you measure the inference time only or all the execution time (including conversion and benchmark)?

XavierNX (sm=7.2) supports FP32, FP16, and INT8 inference.
You can find the support matrix below:

Thanks

Hi @AastaLLL,

Yes, we have maximized the performance of the device.

We tested two approaches:

  1. The first is original FP 32 NN model but FP16 enabled for TensorRT.
    MEMORY: We did not observe any significant change in memory consumption when the caching was disabled. Only with the TRT engine file caching enabled we were able to save up to 700 MB of memory.
    INFERENCE PERF: For some NN models the inference performance has improved 2 times.
    NN MODEL LOAD: However, the load/compilation of the NN model is 4-5 times slower.

  2. Second approach was converting the ONNX model to FP16 using a script provided by ONNX (Microsoft) and testing both FP32 and FP16 inference.
    MEMORY: There is no difference in memory usage when using FP 16 model.
    INFERENCE PERF: The inference performance has decreased significantly with FP16 model.
    NN MODEL LOAD: The FP16 ONNX model load/compilation is much slower comparing to original FP32 model.

Do you have any advise for us?

Thank you
Regards

Hi,

Usually, we recommend users to use approach 1.
Since the conversion (model → engine) is a one-time job.
You can deserialize the engine file at runtime to save memory and get performance.

For the ONNX integrated version, have you checked this issue with the ONNX team?
If not, would you mind doing so? They may have more information about the ONNX-based implementation.

Thanks.

Hi @AastaLLL,

could you please explain the deserialization process?

Hi,

It should be the same as the “caching” you mentioned above.
You can find our proposed workflow in detail here:

image

Thanks.