VPI dense optical flow not performant when in parallel with streaming output


I am using CUDA backend VPI functions in a main thread which culminates in sending the image to a server using opencv video writer with gstreamer

def gstreamer_out():
    # leaky downstream throws away old images - default queue is 5
    # sync = false might be useful 
    # not tested with real cameras
    #MUX playback ID https://stream.mux.com/vL9SJU61FSv8sSQR01F6ajKI702WeK2pXRuLVtw25zquo.m3u8

    return (
        "appsrc ! "
        "videoconvert ! "
        "video/x-raw, framerate=(fraction)25/1, format=RGBA ! "
        "nvvidconv ! "
        "nvv4l2h264enc ! "
        "h264parse ! "
        "flvmux ! "
        "queue leaky=downstream ! "
        "rtmpsink location=rtmp://global-live.mux.com:5222/app/51bc0427-ad29-2909-4979-11ee335d2b53 sync=false"

    out_stream = cv2.VideoWriter(

in a child thread I am performing dense optical flow on a 1080p image

with time_it("INF: convert image for OF"):
                with streamLeft:
                    curFrame = vpi.asimage(np_img, vpi.Format.BGR8) \
                        .convert(vpi.Format.NV12_ER, backend=vpi.Backend.CUDA) \
                        .convert(vpi.Format.NV12_ER_BL, backend=vpi.Backend.VIC)
            if prevFrame is not None:
                with time_it("INF: optical flow"):
                # Calculate the motion vectors from previous to current frame
                    with vpi.Backend.NVENC:
                        with streamLeft:
                            motion_vectors = vpi.optflow_dense(prevFrame, curFrame, quality = vpi.OptFlowQuality.LOW)

My issue is that this drops the output FPS of the main thread from ~30fps to about ~12fps

I understand the codec nvv4l2h264enc is using the NVENC chip, as is the Optical Flow - can the chip not be used in parallel? I am using max power settings. What are the options for CPU video encoding instead?

many thanks


We want to check the NVENC behavior further.
Would you mind sharing the complete source with us?



VC_Detect_For_Nvidia.py (9.3 KB)


Thanks for the sample.
The instructions within the sample are detailed.
We will give it a check and share more information with you later.


Thanks AastaLL, I can provide the link to view the RTMP video if necessary

Also we are not married to the opencv/gstreamer components, as long as we can send 30/40fps 1080p to an rtmp/rtmps endpoint thats all we need so other solutions are very welcome - including CPU encoding at last resort


Suppose you should be able to reproduce a similar behavior with a video input/output.
If so, could you modify the sample to use video? It will be easier for our internal team to check.



We have checked the sample shared in Apr 27.
However, the sample doesn’t call dense optical flow.

Could you double check the file?


Hi AastaLLL, oops that was the wrong file! Here is the correct one. The behaviour is collapse of the output FPS from~40fps no optical flow down to ~18fps with optical flow. I have not had an opportunity to convert it to a video output as not clear on how that is done yet

VC_OF_Detect_For_Nvidia.py (8.4 KB)


Could you share some info about which elapsed time we should focus on?
We test the script with optical on and off. The performance looks vary across the frames:

Optical Flow OFF

VC: upload to GPU (2): 0.301ms
VC: perp processing & sync (2): 0.541ms
VC: output GPU to CPU (1): 1.399ms
VC: put image on queue (2): 0.172ms
INF: get object off queue: 24.199ms
INF: convert image for OF: 5.673ms
VC: draw on rectangles: 6.136ms
INF: get object off queue: 0.148ms
VC: output to mux: 33.566ms
VC: upload to GPU (2): 0.354ms
VC: perp processing & sync (2): 0.590ms
INF: convert image for OF: 37.251ms
VC: output GPU to CPU (1): 2.402ms
VC: put image on queue (2): 0.089ms
INF: get object off queue: 23.352ms
VC: draw on rectangles: 0.409ms
INF: convert image for OF: 2.546ms
INF: get object off queue: 0.059ms
INF: convert image for OF: 9.984ms
VC: output to mux: 12.066ms

Optical Flow ON

VC: upload to GPU (2): 0.828ms
VC: perp processing & sync (2): 5.917ms
VC: output GPU to CPU (1): 3.388ms
VC: draw on rectangles: 0.369ms
INF: optical flow: 63.797ms
INF: get object off queue: 0.056ms
INF: convert image for OF: 0.884ms
VC: output to mux: 46.051ms
VC: upload to GPU (2): 0.222ms
VC: perp processing & sync (2): 0.546ms
VC: output GPU to CPU (1): 1.207ms
VC: draw on rectangles: 0.359ms
VC: output to mux: 12.077ms

Hi AastaLLL and thats great you tried some tests

Yes “output to mux” is the opencv video writer with gstreamer, which far as I know using the NVENC chip encoding the frames

With OF off that output stays below10ms, and get an output FPS on Mux (our video endpoint service) of ~42. With OF on, we hit maximums of 42ms+ every few frames and an FPS on Mux of 20.


We turn off optical flow by marking the below condition:

if 0:#prevFrame is not None:

So the CUDA/VIC conversion for OF still remains.

In such cases, we still can see the occasional latency when writing the image to mux.
So the cause might not be the NVENC but the extra loading of OF.


Hi AastaLLL, this is well spotted - any suggestions how to speed the conversion or is this the optimal setting?


Sorry for the late update.

It seems there are several CPU ↔ GPU buffer transfers in your pipeline.
Maybe you can try our jetson-utils which can read the camera to a GPU buffer to see if it helps.


Hi AastaLLL

I didn’t realise there are several CPU-GPU transfers, I only thought it happened when I loaded in the test image and when I prepared it to send to our streaming platform

I will try with a real camera and see if that makes a difference!

Many thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.