I have an application that is running a classifier and detector in TensorRT on a T4. Performance is good when I run one stream. However, I can’t reproduce the same performance using multiple streams.
I’m developing using C++.
When I use 1 stream…. I get the following…
GPU memory: 1335 MB
Average time taken for detection: 4.64ms
CPU Usage: 19.8%
GPU utilization is around 70% when the detector is being used but can spike to over 90% when the classifier is called.
When I use 2 streams… I get the following…
GPU memory: 2670 MB
Average time taken for detection: 38ms
CPU Usage: 54% (27% for each stream)
GPU utilization is at 100%
When I use 1 stream, the inference from TensorRT (for both the detection and classification) is fast and I can process up to 40 frames per second. However, there is a drastic slowdown when I use 2 streams.
I had to change my implementation to do batch processing instead. So I run the detection in batches of 50. The GPU utilization spikes to around 40% when an inference is being made and goes back down to 0 when the inference is complete. It essentially allows the GPU core usage to be staggered among multiple streams.
I am working on a similar use case where I want to perform inference on 15 parallel streams coming from 15 different video files. My models include two different kinds of detectors and one classifier, all written in tensorflow. I have programming in python. Can you help me understand what would be the best to handle the 15 streams in parallel and pass them as batches to the tensorflow networks.