Switching tensorrt compiled engines without reloading from file

rohan.a · February 13, 2025, 2:15pm

Description

I would like to understand how I can properly do VRAM management with

trt_model = torch.export.load(file).module()

Say I have two models and limited memory space such that only one of the models can exist in VRAM at the same time. I also have it such that I only need to run one of these two modules at a given time.

What is the best way to keep swapping these two modules into VRAM, without having to load from file every time? Is this even possible?

Environment

TensorRT Version: 10.7
GPU Type: RTX 4090, RTX 4060
Nvidia Driver Version: 560.35.05
CUDA Version: 12.4
CUDNN Version: 9.*
Operating System + Version: Ubuntu 22.04
Python Version (if applicable): 3.10
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 12.4
Baremetal or Container (if container which image + tag):

Steps To Reproduce

NA as its not a bug report.

AakankshaS · February 28, 2025, 1:16pm

Hi @rohan.a ,
Pls check and let me know if this helps.

Optimizing GPU Memory Usage:

Minimize memory fluctuations by optimizing GPU memory usage during model initialization and inference. This can help ensure smoother transitions as you switch between models.

Utilizing GPU Memory Efficiently:

Employ memory reuse and memory sharing techniques to manage your limited VRAM effectively. Carefully allocate and deallocate memory to avoid unnecessary overhead.

Model Quantization:

Explore quantization methods, for example, adopting INT8 precision for models to reduce their memory footprint while maintaining an acceptable level of performance, which helps in making the best use of limited VRAM.

GPUDirect Technology:

Leverage NVIDIA GPUDirect technology for direct transfers between GPUs, minimizing host memory involvement which aids in optimizing VRAM usage when switching models.

Implementing Memory Compression:

Consider memory compression techniques to lower the memory footprint of the models while in VRAM. This can be beneficial for maximizing VRAM utilization.

Dynamic Memory Management:

Use dynamic memory management strategies for the efficient allocation and deallocation of memory based on the models’ specific requirements, helping to optimize VRAM usage.

Monitoring and Profiling:

Continuously monitor and profile VRAM usage to identify inefficiencies or bottlenecks. By understanding usage patterns, you can fine-tune VRAM management strategies for better performance.

rohan.a · March 4, 2025, 2:01pm

Are these possible with the torch_tensorrt library? I can manage memory with pure tensorrt python API. I just have to malloc and deallocate between runs.

On further profiling, I have figured that its the CPU → GPU transfer of the engine that is taking the chunk of time. So, unfortunately this would mean I have to make the model much smaller and keep both in memory to optimize for time.

Thanks for your help!

Topic		Replies	Views
Multi model inference - Swap GPU memory TensorRT	3	649	February 19, 2021
Multi model inference - Swap GPU memory TensorRT tensorrt , tensorflow	10	1605	February 23, 2021
Correct way for reload engine for save memory TensorRT	2	595	October 7, 2021
TensorRt Engines On Nvidia Xavier Nx Jetson Xavier NX tensorrt	1	697	December 28, 2021
Memory usage will drop after I use trtexec Jetson Xavier NX tensorrt	6	1322	October 28, 2021
TF-TRT does not free up memory when converting a SavedModel TensorRT	2	441	August 24, 2023
TensorRT used lots of memory when loading model files Jetson Orin NX tensorrt	5	1655	May 9, 2023
TensorRT model consuming more amount of RAM on Jetson TX2 Jetson TX2 tensorrt	4	1238	August 26, 2020
TensorRT CPU Memory Management TensorRT jetson-inference , jetson	5	1856	July 7, 2022
New TensorRT Model occupying more GPU Memory as compared to older version TensorRT tensorrt , tensorflow , gpu	8	2113	August 20, 2021

Switching tensorrt compiled engines without reloading from file

Description

Environment

Steps To Reproduce

Related topics