Performance Comparison: Multiple CUDA Streams with Multiple TensorRT Models vs. Combining Multiple TensorRT Models

xiaoqiang_dhu · December 23, 2023, 1:08pm

I am working on optimizing inference performance using TensorRT and CUDA streams. I have multiple TensorRT models that need to be executed in my application. I would like to understand the performance implications of two different approaches:

Using multiple CUDA streams with separate TensorRT models for parallel execution.
Combining multiple TensorRT models into a single model.
I understand that combining models can optimize resource utilization by reducing memory overhead and eliminating data transfer between models, which is beneficial when hardware resources are limited. However, I also know that using multiple CUDA streams can enable parallel execution, potentially improving overall throughput and reducing latency.

Considering factors such as model independence, hardware resources, inference pipeline, model complexity, and memory bandwidth, I would like to know which approach is generally faster and more efficient in terms of inference performance.

Any insights, benchmarks, or best practices related to these approaches would be greatly appreciated. Thank you!

Topic		Replies	Views
Use multiple CUDA streams with multiple TensorRT models Jetson AGX Orin tensorrt , cuda	3	395	December 26, 2023
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1175	May 11, 2021
Multi-model parallel inferencing TensorRT	1	367	March 31, 2023
Batching vs CUDA Streams for concurrent inferences? TensorRT tensorrt , cuda	7	1908	October 12, 2021
Tensor cores and CUDA cores work in parallel Video Processing & Optical Flow cuda	2	192	July 10, 2024
Tensorrt Threads affect each other during multithreaded inference TensorRT tensorrt	16	1368	September 6, 2024
Inference Time When Using Multi Stream in TensorRT is Much Slower than a Single One TensorRT tensorrt	5	2461	March 30, 2023
Whether the model execution using multiple streams in the tensorRT framework USES multicore concurrency TensorRT	0	424	March 4, 2020
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1318	February 20, 2025
Latency when running TensorRT engine on two GPU TensorRT	9	1229	August 24, 2020

Performance Comparison: Multiple CUDA Streams with Multiple TensorRT Models vs. Combining Multiple TensorRT Models

Related topics