what is `Kernel Auto-Tuning` and `Multi-Stream Execution`?

AndrewGong · December 11, 2018, 3:59am

hello, in page [url]https://developer.nvidia.com/tensorrt[/url], it introduces that TensorRT has a few Optimizations and Performance. I do not understand the Kernel Auto-Tuning and Multi-Stream Execution :

Selects best data layers and algorithms based on target GPU platform means what?
Scalable design to process multiple input streams in parallel, this stream is cudaStream?

can you give some detailed explains or url with detailed materials? I can not find some informations in [url]https://docs.nvidia.com/deeplearning/sdk/tensorrt-developer-guide[/url]

NVES · December 11, 2018, 5:39pm

Hello,

Regarding question 1, “best layer” means TensorRT performs several transformations and optimizations to the neural network graph. First, layers with unused output are eliminated to avoid unnecessary computation. Next, where possible convolution, bias, and ReLU layers are fused to form a single layer. Layer fusion improves the efficiency of running Tensor RT-optimized networks on the GPU. Another transformation is horizontal layer fusion, which improves performance by combining layers that take the same source tensor and apply the same operations with similar parameters, resulting in a single larger layer for efficiency.

The TRT builder generates an engine tuned for the target GPU, for example choosing the optimal batch size for the target platform.

If the application and GPU allows, TRT will additionally optimize the network to run in lower precision, further increasing performance and reducing memory requirements.

Regarding question 2. yes, cudastreams.

regards
NVES

AndrewGong · December 12, 2018, 7:52am

@NVES thanks for your reply.

question 1: when I am creating a engine, if I do not set the builder.max_batch_size parameter, the TRT builder will choose the optimal batch size for the target gpu platform as you said? I just run an experiment: After I create I do not set the builder.max_batch_size parameter, I find this engine can not infer two images and more, only can infer one image.

question 2: So I can create multi-engines with different input size using multi-cudastreams? Can you give a example about how to use multi-cudastreams to create and use multi-engines with different input size? Before, I successfully create multi-engines with different input size using one single cudastream to infer many images with different input size , but it seems to be one-stream operation, my batch images, one by one, need to match the correspondent engine according to input size, so the speed is not fast, how shall I improve it?

looking forward your advises…

NVES · December 12, 2018, 4:36pm

Hello,

batch size indicates how many images are processed at once. If you want your engine to infer more images, you’ll need to define max_batch_size.

AndrewGong · August 2, 2019, 2:01pm

@NVES can you explain Selects best data layers and algorithms based on target GPU platform mean what? I still do not know its meaning.

Topic		Replies	Views
What is "Kernel auto-tuning", I need more information? TensorRT tensorrt , kernel	0	448	May 22, 2023
Multi Stream in TensorRT TensorRT	1	2123	July 28, 2020
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2519	March 30, 2023
Concurrent tensorRT engines TensorRT jetson	1	412	December 5, 2022
Speedup by increasing # of streams vs. batch size TensorRT	2	724	June 23, 2022
How should batching be handled in TensorRT custom Plugin implementations. (Does TensoRT create seperate CUDA streams for each batch?) TensorRT	2	858	October 12, 2021
Tensorrt & multiple streams GPU-Accelerated Libraries	0	1001	February 6, 2018
Trtexec streams TensorRT tensorrt	1	895	March 24, 2022
Batch inference parallelization on tensorrt DeepStream SDK tensorrt	2	496	October 12, 2021
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	976	May 5, 2021

what is `Kernel Auto-Tuning` and `Multi-Stream Execution`?

Related topics