Can I limit the computational resources consumption at the TensorRT engine building stage?

Description

Can I limit the computational resources consumption of each engine, including SM number, grid sizes, block sizes, memory limitations, etc., at the TensorRT engine building stage? My motivation is to run multiple TensorRT engine in parallel with multi-stream, and the kernels in each stream could be scheduled to be executed concurrently .

Environment

TensorRT Version: 8.0
GPU Type: RTX2080Ti, Jetson AGX Orin
CUDA Version: 11.x
Operating System + Version: Ubuntu 18, Ubuntu 20

Hi,

This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.

Thanks!

Hi,
This is not only a jetson issue , it’s a common issue about TensorRT.
I want to limit the resource consumption of the TensorRT engine when building it on both server GPU and edge GPU.
Could you please help me, thank you

Hi,

TensorRT typically optimizes grid and block sizes internally to maximize GPU throughput for inference. We can restrict the memory by setting the workspace. Please refer to the doc below for more details.

TensorRT allows you to specify different optimization profiles, each with its own batch size. This can be used to control the batch size, which can affect the GPU memory usage and computational load.

You can also control which GPU is used for inference by setting the CUDA device before creating the execution context.

For deploying the multiple engines, we recommend using the Triton inference server, which manages resources internally and gives the best performance.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

Thank you.