I would like to understand how I can properly do VRAM management with
trt_model = torch.export.load(file).module()
Say I have two models and limited memory space such that only one of the models can exist in VRAM at the same time. I also have it such that I only need to run one of these two modules at a given time.
What is the best way to keep swapping these two modules into VRAM, without having to load from file every time? Is this even possible?
Environment
TensorRT Version: 10.7 GPU Type: RTX 4090, RTX 4060 Nvidia Driver Version: 560.35.05 CUDA Version: 12.4 CUDNN Version: 9.* Operating System + Version: Ubuntu 22.04 Python Version (if applicable): 3.10 TensorFlow Version (if applicable): PyTorch Version (if applicable): 12.4 Baremetal or Container (if container which image + tag):
Hi @rohan.a ,
Pls check and let me know if this helps.
Optimizing GPU Memory Usage:
Minimize memory fluctuations by optimizing GPU memory usage during model initialization and inference. This can help ensure smoother transitions as you switch between models.
Utilizing GPU Memory Efficiently:
Employ memory reuse and memory sharing techniques to manage your limited VRAM effectively. Carefully allocate and deallocate memory to avoid unnecessary overhead.
Model Quantization:
Explore quantization methods, for example, adopting INT8 precision for models to reduce their memory footprint while maintaining an acceptable level of performance, which helps in making the best use of limited VRAM.
GPUDirect Technology:
Leverage NVIDIA GPUDirect technology for direct transfers between GPUs, minimizing host memory involvement which aids in optimizing VRAM usage when switching models.
Implementing Memory Compression:
Consider memory compression techniques to lower the memory footprint of the models while in VRAM. This can be beneficial for maximizing VRAM utilization.
Dynamic Memory Management:
Use dynamic memory management strategies for the efficient allocation and deallocation of memory based on the models’ specific requirements, helping to optimize VRAM usage.
Monitoring and Profiling:
Continuously monitor and profile VRAM usage to identify inefficiencies or bottlenecks. By understanding usage patterns, you can fine-tune VRAM management strategies for better performance.
Are these possible with the torch_tensorrt library? I can manage memory with pure tensorrt python API. I just have to malloc and deallocate between runs.
On further profiling, I have figured that its the CPU → GPU transfer of the engine that is taking the chunk of time. So, unfortunately this would mean I have to make the model much smaller and keep both in memory to optimize for time.