what is `Kernel Auto-Tuning` and `Multi-Stream Execution`?

hello, in page https://developer.nvidia.com/tensorrt, it introduces that TensorRT has a few Optimizations and Performance. I do not understand the Kernel Auto-Tuning and Multi-Stream Execution :

  1. Selects best data layers and algorithms based on target GPU platform means what?

  2. Scalable design to process multiple input streams in parallel, this stream is cudaStream?

can you give some detailed explains or url with detailed materials? I can not find some informations in https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide

Hello,

Regarding question 1, “best layer” means TensorRT performs several transformations and optimizations to the neural network graph. First, layers with unused output are eliminated to avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Layer fusion improves the efficiency of running Tensor RT-optimized networks on the GPU. Another transformation is horizontal layer fusion, which improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for efficiency.

The TRT builder generates an engine tuned for the target GPU, for example choosing the optimal batch size for the target platform.

If the application and GPU allows, TRT will additionally optimize the network to run in lower precision, further increasing performance and reducing memory requirements.

Regarding question 2. yes, cudastreams.

regards
NVES

@NVES thanks for your reply.

question 1: when I am creating a engine, if I do not set the builder.max_batch_size parameter, the TRT builder will choose the optimal batch size for the target gpu platform as you said? I just run an experiment: After I create I do not set the builder.max_batch_size parameter, I find this engine can not infer two images and more, only can infer one image.

question 2: So I can create multi-engines with different input size using multi-cudastreams? Can you give a example about how to use multi-cudastreams to create and use multi-engines with different input size? Before, I successfully create multi-engines with different input size using one single cudastream to infer many images with different input size , but it seems to be one-stream operation, my batch images, one by one, need to match the correspondent engine according to input size, so the speed is not fast, how shall I improve it?

looking forward your advises…

Hello,

batch size indicates how many images are processed at once. If you want your engine to infer more images, you’ll need to define max_batch_size.

@NVES can you explain Selects best data layers and algorithms based on target GPU platform mean what? I still do not know its meaning.