TensorRT-8: memory usage with dynamic input shapes

Description

During integration of dynamic shape support for a detection algorithm, I’ve encountered an interesting behavior of TensorRT. It seems that the device memory consumption depends on the maximum input size across all optimization profiles and, particularly, not limited to the currently used profile or the currently used input resolution. I’ve tested this behavior by adding 3 profiles, selecting the 2nd profile and tweaking the max size for the 3rd profile (the actual input shape and 1/2nd profiles are kept unchanged).

So, it looks like device memory is allocated according to the worst possible case (ie the upper boundary on the input shape across all profiles). Is this understanding correct? If so, is it possible to work around the limitation (for example, creating engine without memory and plugging a sufficient workspace buffer as needed)?

P.S. The motivation is to allow working with a wide range of input resolutions, but pay (in terms of memory) for large resolutions only when we’ve encountered such large resolution.

Environment

TensorRT Version: v8.2.2.1
GPU Type: RTX 2070
Nvidia Driver Version: 470.63.01
CUDA Version: v11.4.3
CUDNN Version: v8.2.4.15

Any news on the topic?

Hi,

Sorry for the delay in addressing this issue. We will get back to you in 1 or 2 days.
Thank you.

Hi,

Following may help you.
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#ab431cff77cee511d37747920f6c2276f

Thank you.

Hi, @spolisetty

Thank you for replying.

I’ve looked into the suggested API. As I understand, I’ll need to use IExecutionContext::setDeviceMemory() [1] method to provide my own device memory buffer. The documentation of this method states that I would need at least ICudaEngine::getDeviceMemorySize() [2] bytes. But getDeviceMemorySize() does not takes any parameter that would allow me to select a particular optimization profile. And API for choosing optimization profiles is present in IExecutionContext class [3] and not ICudeEngine.

In short, it seems that I can use my own device buffer, but its size does not depend on the optimization profile (and corresponds to the same worst-case profile logic). So, I’m effectively in the same sitation as without custom device buffers. Or did I miss something?

[1]https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#a12f8214ba871e63ec9c4dac970bb9c39
[2]https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#a5dbb256ba0555c4e58eac5e4b876c7ee
[3]https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_execution_context.html#a74c361a3d93e70a3164988df7d60a4cc

Hi,

Good question, yes you’re right. Always allocating for the worst case is a simplifying assumption in the executor - we’re assuming that the application would need to budget memory for your worst expected case. But this assumption isn’t always valid.

Thank you.

Hi, @spolisetty

Thank you for the information. Can we expect any improvements on the subject in the foreseeable future?

Yes, this may be improved in the future releases.

Thank you.

Okay, thank you!

Hi! Are there any advancements on the topic? TensorRT significantly reduces memory consumption for my model, but allocating memory for the worst case negates all improvements:(
I would like to be able to independently control the required memory.

2 Likes

Bumping this, as this topic is of high concern to me. In my use case, I am using triton to serve ~40 computer vision models (mainly CNNs). The VRAM usage is much higher when converting the models to TensorRT engines as compared to when they were normal Tensorflow models, to the point that some models cannot be initialized when all could previously. This is quite a bummer as TensorRT was able to provide quite a decent speedup (~5x) to most of the models through my experiments.