Description
We get cv::Mat frames using this OpenCV gstreamer pipeline:
filesrc location="./video.mp4" ! qtdemux ! h264parse ! queue ! nvv4l2decoder ! queue ! nvvideoconvert ! video/x-raw,format=BGRx ! videorate max-rate=30 ! videoscale ! video/x-raw,format=BGRx,width=1920,height=1080 ! queue ! videoconvert ! video/x-raw,format=BGR ! appsink
We specify a model using Yolov7::Yolov7
after which we pass the cv::Mat frames to Yolov7::preProcess
and then run Yolov7::infer
and Yolov7::PostProcess
.
The inference works and everything runs fine at this point (around 31 seconds to process a 31 second video).
When we then spin up another thread that does the same thing in parallel, the combined process takes around 6 seconds longer than with a single thread.
For every additional thread after that, there is an additional 20-25 second increase in processing time.
After further examination, the culprit lies somewhere within the method enqueueV2
mentioned in Yolov7.cpp
I traced it’s origin via NvInfer.h
and NvInferRuntime.h
to NvInferImpl.h
.
There the class class VExecutionContext : public VRoot
has the method
virtual bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept = 0;
From there I can’t find further information nor definition of how it works and why it would be slowing down the overall process.
Any idea of why this is happening?
Environment
TensorRT Version:
TensorRT 8.4.1
GPU Type:
Jetson Orin AGX
Nvidia Driver Version:
CUDA Version:
Cuda SDK 11.4.14
CUDNN Version:
cuDNN 8.4.1
Operating System + Version:
Ubuntu 20.04.6 LTS - JetPack 5.0.2-b231
Relevant Files
We are using the library called Yolov7 made by an Nvidia employee.