Multiple threads running inference are causing a slowdown


We get cv::Mat frames using this OpenCV gstreamer pipeline:
filesrc location="./video.mp4" ! qtdemux ! h264parse ! queue ! nvv4l2decoder ! queue ! nvvideoconvert ! video/x-raw,format=BGRx ! videorate max-rate=30 ! videoscale ! video/x-raw,format=BGRx,width=1920,height=1080 ! queue ! videoconvert ! video/x-raw,format=BGR ! appsink

We specify a model using Yolov7::Yolov7 after which we pass the cv::Mat frames to Yolov7::preProcess and then run Yolov7::infer and Yolov7::PostProcess.

The inference works and everything runs fine at this point (around 31 seconds to process a 31 second video).
When we then spin up another thread that does the same thing in parallel, the combined process takes around 6 seconds longer than with a single thread.
For every additional thread after that, there is an additional 20-25 second increase in processing time.

After further examination, the culprit lies somewhere within the method enqueueV2 mentioned in Yolov7.cpp
I traced it’s origin via NvInfer.h and NvInferRuntime.h to NvInferImpl.h.
There the class class VExecutionContext : public VRoot has the method
virtual bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept = 0;
From there I can’t find further information nor definition of how it works and why it would be slowing down the overall process.

Any idea of why this is happening?


TensorRT Version:
TensorRT 8.4.1
GPU Type:
Jetson Orin AGX
Nvidia Driver Version:
CUDA Version:
Cuda SDK 11.4.14
CUDNN Version:
cuDNN 8.4.1
Operating System + Version:
Ubuntu 20.04.6 LTS - JetPack 5.0.2-b231

Relevant Files

We are using the library called Yolov7 made by an Nvidia employee.


The below links might be useful for you.

For multi-threading/streaming, will suggest you to use Deepstream or TRITON

For more details, we recommend you raise the query in Deepstream forum.


raise the query in Triton Inference Server Github instance issues section.