Performance Comparison: Multiple CUDA Streams with Multiple TensorRT Models vs. Combining Multiple TensorRT Models

I am working on optimizing inference performance using TensorRT and CUDA streams. I have multiple TensorRT models that need to be executed in my application. I would like to understand the performance implications of two different approaches:

Using multiple CUDA streams with separate TensorRT models for parallel execution.
Combining multiple TensorRT models into a single model.
I understand that combining models can optimize resource utilization by reducing memory overhead and eliminating data transfer between models, which is beneficial when hardware resources are limited. However, I also know that using multiple CUDA streams can enable parallel execution, potentially improving overall throughput and reducing latency.

Considering factors such as model independence, hardware resources, inference pipeline, model complexity, and memory bandwidth, I would like to know which approach is generally faster and more efficient in terms of inference performance.

Any insights, benchmarks, or best practices related to these approaches would be greatly appreciated. Thank you!