We’re using TRT 5.0 python to run our model and find that the CPU RAM consumption is about 2.6G of memory. We find that
1.1G is consumed when creating the TRT runtime itself
1.5G additionally used after the call to deserialize_cuda_engine.
This size does not seem to vary by much based on the model’s input size or FP16 vs FP32. We’ve also checked with using a C+±only inference engine, and get lower but still very high memory usage of approx 1.9G. We checked different models with GPU memory usage between 0.8-1.4G.
There is a setting called max_workspace_size, which can affect the amount of consumed GPU memory, but in our case modifying this value did not produce significant differences.
My questions are:
are these large values expected, or is the expected memory usage significantly lower?
how can we reduce the RAM usage? We aim for less than 0.5G RAM
Thanks,
Ran
Details:
output produced by a profiling tool showing the memory increase per line:
so there is no way to reduce this contact memory usage?
TRT requires 0.8G - 1.1G RAM when loading no matter what? any plans to improve this in future versions?
what about the 1.5G RAM used when loading the model, is there a way to reduce the memory? The model itself is loaded to the GPU, so why is there a need to hold so much CPU memory?
We also facing similar issue when loading TensorRT engine.We are working on application where multiple networks to be load on to RAM on jetson nano. TensoRT taking more memory even though my each network size of 50MB.
Please point us if inference-essential cuda libraries are available for nano.
TensorRT now uses cuBLASLt internally instead of cuBLAS. This decreases the overall runtime memory footprint. Users can revert to the old behavior by using the new setTacticSources API in IBuilderConfig.
I haven’t been able to test time on Jetson boards.