TensorRT-8: memory usage with dynamic input shapes

Description

During integration of dynamic shape support for a detection algorithm, I’ve encountered an interesting behavior of TensorRT. It seems that the device memory consumption depends on the maximum input size across all optimization profiles and, particularly, not limited to the currently used profile or the currently used input resolution. I’ve tested this behavior by adding 3 profiles, selecting the 2nd profile and tweaking the max size for the 3rd profile (the actual input shape and 1/2nd profiles are kept unchanged).

So, it looks like device memory is allocated according to the worst possible case (ie the upper boundary on the input shape across all profiles). Is this understanding correct? If so, is it possible to work around the limitation (for example, creating engine without memory and plugging a sufficient workspace buffer as needed)?

P.S. The motivation is to allow working with a wide range of input resolutions, but pay (in terms of memory) for large resolutions only when we’ve encountered such large resolution.

Environment

TensorRT Version: v8.2.2.1
GPU Type: RTX 2070
Nvidia Driver Version: 470.63.01
CUDA Version: v11.4.3
CUDNN Version: v8.2.4.15

Any news on the topic?

Hi,

Sorry for the delay in addressing this issue. We will get back to you in 1 or 2 days.
Thank you.

Hi,

Following may help you.
https://docs.nvidia.com/deeplearning/tensorrt/api/c_api/classnvinfer1_1_1_i_cuda_engine.html#ab431cff77cee511d37747920f6c2276f

Thank you.

Hi, @spolisetty

Thank you for replying.

I’ve looked into the suggested API. As I understand, I’ll need to use IExecutionContext::setDeviceMemory() [1] method to provide my own device memory buffer. The documentation of this method states that I would need at least ICudaEngine::getDeviceMemorySize() [2] bytes. But getDeviceMemorySize() does not takes any parameter that would allow me to select a particular optimization profile. And API for choosing optimization profiles is present in IExecutionContext class [3] and not ICudeEngine.

In short, it seems that I can use my own device buffer, but its size does not depend on the optimization profile (and corresponds to the same worst-case profile logic). So, I’m effectively in the same sitation as without custom device buffers. Or did I miss something?

[1]TensorRT: nvinfer1::IExecutionContext Class Reference
[2]TensorRT: nvinfer1::ICudaEngine Class Reference
[3]TensorRT: nvinfer1::IExecutionContext Class Reference

Hi,

Good question, yes you’re right. Always allocating for the worst case is a simplifying assumption in the executor - we’re assuming that the application would need to budget memory for your worst expected case. But this assumption isn’t always valid.

Thank you.

Hi, @spolisetty

Thank you for the information. Can we expect any improvements on the subject in the foreseeable future?

Yes, this may be improved in the future releases.

Thank you.

Okay, thank you!