I have an application that is running a classifier and detector in TensorRT on a T4. Performance is good when I run one stream. However, I can’t reproduce the same performance using multiple streams.
I’m developing using C++.
When I use 1 stream…. I get the following…
GPU memory: 1335 MB
Average time taken for detection: 4.64ms
CPU Usage: 19.8%
GPU utilization is around 70% when the detector is being used but can spike to over 90% when the classifier is called.
When I use 2 streams… I get the following…
GPU memory: 2670 MB
Average time taken for detection: 38ms
CPU Usage: 54% (27% for each stream)
GPU utilization is at 100%
When I use 1 stream, the inference from TensorRT (for both the detection and classification) is fast and I can process up to 40 frames per second. However, there is a drastic slowdown when I use 2 streams.