High RAM consumption with CUDA and TensorRT on Jetson Xavier NX

marek.lipovsky · July 8, 2021, 11:41am

Hello.

We are having issues with high memory consumption on Jetson Xavier NX especially when using TensorRT via ONNX RT.

By default our NN models are in FP32, so we tried converting to FP16 which makes the NN model smaller. However, during the model inference the memory consumption is the same as with FP32.

I did enable FP16 inference using ORT_TENSORRT_FP16_ENABLE=1 as suggested by ONNX Runtime documentation (TensorRT - onnxruntime). But it didn’t help.

Does Jetson Xavier NX support both FP16 and FP32, especially for CUDA and TensorRT?
Is there any other way to reduce the memory consumption when using CUDA and TensorRT?

We also found out that with any NN model the process using ONNX RT with CUDA execution provider utilizes at least 1.5GB of RAM and with TensorRT execution provider at least 2 GB of RAM. Is this expected?

Is there maybe any light version of the libraries to be used on a device with limited RAM like Jetson?

Thank you in advance
Marek

AastaLLL · July 9, 2021, 3:01am

Hi,

Could you run your model with trtexec to test the memory required for TensorRT?

/usr/src/tensorrt/bin/trtexec --onnx=[your/model]                  #fp32
/usr/src/tensorrt/bin/trtexec --onnx=[your/model] --fp16           #fp16

To inference with cuDNN (TensorRT), it requires at least 600 MB of memory for loading the library.
If you run it with ONNX Runtime, you need more memory to load the all required library.

Is pure TensorRT API an option for you?

Thanks.

marek.lipovsky · July 9, 2021, 8:30am

Hi,

Thank you for prompt response.

I tried to test the same NN model as suggested and measured the memory consumption (Maximum resident set size (kbytes):) using “/usr/bin/time -v”.

trtexec with fp32 model: 1987695 KB
trtexec with fp16 model: 2010336 KB
ONNXRT → TensorRT with fp32 model: 2172236 KB

Could you please send me any sample ONNX model to test both fp32 and fp16 in trtexec and ONNX RT?

We would like to avoid using trtexec directly.

Regards
Marek

marek.lipovsky · July 9, 2021, 3:09pm

Using trtexec reduces the memory consumption by 150-200 MB comparing to ONNX Runtime with TRT. So this does not seem to be the issue.

In case of some NN models the RAM consumption is much lower when the models is loaded from existing engine file. Is it expected the model optimization (engine file creation) requires much more memory in comparison to inference only?

We have observed the FP16 inference is slower comparing to FP32. Is it possible? Our expectation is the performance would be at least the same, maybe better, but not worse.

And the last issue is the TRT optimization of FP16 model takes extremely long comparing to FP32. How is it possible?

My understanding is Jetson Xavier NX supports FP16 inference.

Thank you.
Best Regards

marek.lipovsky · July 13, 2021, 5:55am

Hi,

Any feedback regarding this topic?

Thanks

AastaLLL · July 20, 2021, 9:12am

Hi,

First, have you maximized the device performance before benchmarking?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Since TensorRT loads the corresponding library when required.
It’s possible that the memory consumption with or without conversion will be different.

In general, FP16 should be faster than FP32, but it might take a longer time for conversion.
Do you measure the inference time only or all the execution time (including conversion and benchmark)?

XavierNX (sm=7.2) supports FP32, FP16, and INT8 inference.
You can find the support matrix below:

Thanks

marek.lipovsky · August 4, 2021, 8:18am

Hi @AastaLLL,

Yes, we have maximized the performance of the device.

We tested two approaches:

The first is original FP 32 NN model but FP16 enabled for TensorRT.
MEMORY: We did not observe any significant change in memory consumption when the caching was disabled. Only with the TRT engine file caching enabled we were able to save up to 700 MB of memory.
INFERENCE PERF: For some NN models the inference performance has improved 2 times.
NN MODEL LOAD: However, the load/compilation of the NN model is 4-5 times slower.
Second approach was converting the ONNX model to FP16 using a script provided by ONNX (Microsoft) and testing both FP32 and FP16 inference.
MEMORY: There is no difference in memory usage when using FP 16 model.
INFERENCE PERF: The inference performance has decreased significantly with FP16 model.
NN MODEL LOAD: The FP16 ONNX model load/compilation is much slower comparing to original FP32 model.

Do you have any advise for us?

Thank you
Regards

AastaLLL · August 17, 2021, 4:00am

Hi,

Usually, we recommend users to use approach 1.
Since the conversion (model → engine) is a one-time job.
You can deserialize the engine file at runtime to save memory and get performance.

For the ONNX integrated version, have you checked this issue with the ONNX team?
If not, would you mind doing so? They may have more information about the ONNX-based implementation.

Thanks.

marek.lipovsky · August 20, 2021, 1:40pm

Hi @AastaLLL,

could you please explain the deserialization process?

AastaLLL · August 23, 2021, 5:33am

Hi,

It should be the same as the “caching” you mentioned above.
You can find our proposed workflow in detail here:

Thanks.

Topic		Replies	Views
High RAM consumption with CUDA and TensorRT on Jetson Xavier NX TensorRT	1	514	July 8, 2021
ONNX Model Int64 Weights TensorRT	12	13288	February 17, 2024
Same memory usage for fp16 and int8 Jetson Xavier NX tensorrt	4	2144	September 27, 2021
Tensorrt Engine use too much memory TensorRT tensorrt	1	1595	December 13, 2021
TensorRT small model high RAM consumption during inference problem Jetson Orin Nano tensorrt , cuda , cudnn , yocto , jetson	10	109	November 7, 2024
Excessive RAM usage Jetson Xavier NX pytorch , docker-machine-learning	4	866	February 12, 2024
TensorRT CPU Memory Management TensorRT jetson-inference , jetson	5	1680	July 7, 2022
Extreme engine building time for certain models on Windows with FP16 TensorRT	6	1205	March 23, 2022
Time of inference in FP16 and FP32 is the same Jetson TX2 tensorrt	20	1679	August 10, 2022
TensorRT used lots of memory when loading model files Jetson Orin NX tensorrt	6	1170	May 31, 2023

High RAM consumption with CUDA and TensorRT on Jetson Xavier NX

Related topics