We get cv::Mat frames using this OpenCV gstreamer pipeline:
filesrc location="./video.mp4" ! qtdemux ! h264parse ! queue ! nvv4l2decoder ! queue ! nvvideoconvert ! video/x-raw,format=BGRx ! videorate max-rate=30 ! videoscale ! video/x-raw,format=BGRx,width=1920,height=1080 ! queue ! videoconvert ! video/x-raw,format=BGR ! appsink
We specify a model using
Yolov7::Yolov7 after which we pass the cv::Mat frames to
Yolov7::preProcess and then run
The inference works and everything runs fine at this point (around 31 seconds to process a 31 second video).
When we then spin up another thread that does the same thing in parallel, the combined process takes around 6 seconds longer than with a single thread.
For every additional thread after that, there is an additional 20-25 second increase in processing time.
After further examination, the culprit lies somewhere within the method
enqueueV2 mentioned in
I traced it’s origin via
There the class
class VExecutionContext : public VRoot has the method
virtual bool enqueueV2(void* const* bindings, cudaStream_t stream, cudaEvent_t* inputConsumed) noexcept = 0;
From there I can’t find further information nor definition of how it works and why it would be slowing down the overall process.
Any idea of why this is happening?
Jetson Orin AGX
Nvidia Driver Version:
Cuda SDK 11.4.14
Operating System + Version:
Ubuntu 20.04.6 LTS - JetPack 5.0.2-b231
We are using the library called Yolov7 made by an Nvidia employee.