in a child thread I am performing dense optical flow on a 1080p image
with time_it("INF: convert image for OF"):
with streamLeft:
curFrame = vpi.asimage(np_img, vpi.Format.BGR8) \
.convert(vpi.Format.NV12_ER, backend=vpi.Backend.CUDA) \
.convert(vpi.Format.NV12_ER_BL, backend=vpi.Backend.VIC)
if prevFrame is not None:
with time_it("INF: optical flow"):
# Calculate the motion vectors from previous to current frame
with vpi.Backend.NVENC:
with streamLeft:
motion_vectors = vpi.optflow_dense(prevFrame, curFrame, quality = vpi.OptFlowQuality.LOW)
My issue is that this drops the output FPS of the main thread from ~30fps to about ~12fps
I understand the codec nvv4l2h264enc is using the NVENC chip, as is the Optical Flow - can the chip not be used in parallel? I am using max power settings. What are the options for CPU video encoding instead?
Thanks AastaLL, I can provide the link to view the RTMP video if necessary
Also we are not married to the opencv/gstreamer components, as long as we can send 30/40fps 1080p to an rtmp/rtmps endpoint thats all we need so other solutions are very welcome - including CPU encoding at last resort
Suppose you should be able to reproduce a similar behavior with a video input/output.
If so, could you modify the sample to use video? It will be easier for our internal team to check.
Hi AastaLLL, oops that was the wrong file! Here is the correct one. The behaviour is collapse of the output FPS from~40fps no optical flow down to ~18fps with optical flow. I have not had an opportunity to convert it to a video output as not clear on how that is done yet
Could you share some info about which elapsed time we should focus on?
We test the script with optical on and off. The performance looks vary across the frames:
Optical Flow OFF
VC: upload to GPU (2): 0.301ms
VC: perp processing & sync (2): 0.541ms
VC: output GPU to CPU (1): 1.399ms
VC: put image on queue (2): 0.172ms
INF: get object off queue: 24.199ms
INF: convert image for OF: 5.673ms
VC: draw on rectangles: 6.136ms
INF: get object off queue: 0.148ms
VC: output to mux: 33.566ms
------
VC: upload to GPU (2): 0.354ms
VC: perp processing & sync (2): 0.590ms
INF: convert image for OF: 37.251ms
VC: output GPU to CPU (1): 2.402ms
VC: put image on queue (2): 0.089ms
INF: get object off queue: 23.352ms
VC: draw on rectangles: 0.409ms
INF: convert image for OF: 2.546ms
INF: get object off queue: 0.059ms
INF: convert image for OF: 9.984ms
VC: output to mux: 12.066ms
Optical Flow ON
VC: upload to GPU (2): 0.828ms
VC: perp processing & sync (2): 5.917ms
VC: output GPU to CPU (1): 3.388ms
VC: draw on rectangles: 0.369ms
INF: optical flow: 63.797ms
INF: get object off queue: 0.056ms
INF: convert image for OF: 0.884ms
VC: output to mux: 46.051ms
------
VC: upload to GPU (2): 0.222ms
VC: perp processing & sync (2): 0.546ms
VC: output GPU to CPU (1): 1.207ms
VC: draw on rectangles: 0.359ms
VC: output to mux: 12.077ms
Yes “output to mux” is the opencv video writer with gstreamer, which far as I know using the NVENC chip encoding the frames
With OF off that output stays below10ms, and get an output FPS on Mux (our video endpoint service) of ~42. With OF on, we hit maximums of 42ms+ every few frames and an FPS on Mux of 20.
It seems there are several CPU ↔ GPU buffer transfers in your pipeline.
Maybe you can try our jetson-utils which can read the camera to a GPU buffer to see if it helps.
I didn’t realise there are several CPU-GPU transfers, I only thought it happened when I loaded in the test image and when I prepared it to send to our streaming platform
I will try with a real camera and see if that makes a difference!