GStreamer RTSP Performance Question

I was wondering if there were anybody who has run a performance analysis to see the CPU/GPU usage of sending images through RTSP to a client on another machine. I have this question because, during my tests, I have noticed that the GStreamer RTSP workflow I have set up takes about 50% of my available CPU on my xavier. Is this normal?

Can the devs give me an idea of how much CPU the RTSP workflow using nvenc should be utilizing. Thanks.

It’s hard to tell more without knowing your resolution, framerate and encoding.
You may share your pipeline to RTSP for better advice.

Hey! I will grab the pipelining. I am actually pushing raw frames from my C++ application into gstreamer to send out through RTSP.

  1. Resolution: 640x480
  2. Framerate: 60fps
  3. Encoding: H264/H265

Our input frames into the gstreamer system are BGR images.

Here’s some raw code that we’re using to setup the pipeline:

    src = gst_element_factory_make("appsrc", NULL);
    g_object_set (src, "max-buffers", 60, NULL);

    g_object_set(GST_OBJECT(src), "name", "source", NULL);
    g_object_set(GST_OBJECT(src), "is-live", true, NULL);
    g_object_set(GST_OBJECT(src), "block", true, NULL);


    gst_app_src_set_caps(GST_APP_SRC(src), caps);
    bgrconv = gst_element_factory_make("videoconvert", "videoconvert");
    conv = gst_element_factory_make("nvvidconv", "conv");

    encode = gst_element_factory_make("omxh265enc", NULL);
    g_object_set(GST_OBJECT(encode), "bitrate", 102400, NULL);

    //number of key frames required, lower number will cause more stutter but more correct frames
    // higher number will cause more predictions, more "smearing"
    g_object_set(GST_OBJECT(encode), "iframeinterval", 5, NULL);

    g_object_set(GST_OBJECT(encode), "preset-level", 1, NULL);
    g_object_set(GST_OBJECT(encode), "control-rate", 1, NULL);


    // Qunatization range for P and I frame,
    // Use string with values of Qunatization Range
    // in MinQpP-MaxQpP:MinQpI-MaxQpP:MinQpB-MaxQpB order, to set the property.
    g_object_set(GST_OBJECT(encode), "qp-range", "20,35:20,35:20,35", NULL);

    parse = gst_element_factory_make("h265parse", NULL);

    pay = gst_element_factory_make("rtph265pay", NULL);
    g_object_set(GST_OBJECT(pay), "config-interval", 1, NULL);

    //TODO when applying frame ids on udpsink directly
    //callback fails to be called, we use identity as a workaround
    identity = gst_element_factory_make("identity", NULL);

    sink = gst_element_factory_make("udpsink", NULL);
    g_object_set(G_OBJECT(sink), "sync", false, NULL);
    g_object_set(G_OBJECT(sink), "async", false, NULL);
    g_object_set(G_OBJECT(sink), "port", 1455, NULL);
    g_object_set(G_OBJECT(sink), "host", "0.0.0.0", NULL);


    GstElement *capsfilter0 = gst_element_factory_make("capsfilter", "capsfilter0");
    GstElement *capsfilter1 = gst_element_factory_make("capsfilter", "capsfilter1");
    GstElement *capsfilter2 = gst_element_factory_make("capsfilter", "capsfilter2");

    std::string capsfilter_caps0_string = std::string("video/x-raw, format=(string)BGR, ")
                                          + std::string("width=(int)") + std::to_string(640) +
                                          std::string(", ")
                                          + std::string("height=(int)") + std::to_string(480) +
                                          std::string(", ")
                                          + std::string(
            "framerate=(fraction)" + std::to_string(60) + "/1");

    GstCaps *capsfilter_caps0 = gst_caps_from_string(capsfilter_caps0_string.c_str());

    std::string capsfilter_caps1_string = std::string("video/x-raw, format=(string)BGRx, ")
                                          + std::string("width=(int)") + std::to_string(640) +
                                          std::string(", ")
                                          + std::string("height=(int)") + std::to_string(480) +
                                          std::string(", ")
                                          + std::string(
            "framerate=(fraction)" + std::to_string(60) + "/1");
    GstCaps *capsfilter_caps1 = gst_caps_from_string(capsfilter_caps1_string.c_str());

    std::string capsfilter_caps2_string = std::string("video/x-raw(memory:NVMM), format=(string)I420, ")
                                          + std::string("width=(int)") + std::to_string(640) +
                                          std::string(", ")
                                          + std::string("height=(int)") + std::to_string(480) +
                                          std::string(", ")
                                          + std::string(
            "framerate=(fraction)" + std::to_string(60) + "/1");
    GstCaps *capsfilter_caps2 = gst_caps_from_string(capsfilter_caps2_string.c_str());

    g_object_set(capsfilter0, "caps", capsfilter_caps0, NULL);
    g_object_set(capsfilter1, "caps", capsfilter_caps1, NULL);
    g_object_set(capsfilter2, "caps", capsfilter_caps2, NULL);

   gst_bin_add_many(GST_BIN(app.pipeline), src, capsfilter0, bgrconv, capsfilter1, conv, capsfilter2, encode, parse, pay, identity, sink, NULL);

I just tested on Xavier NX running L4T R32.5.1.
I simulated your appsrc from videotestsrc. Because videotestsrc is CPU expensive, I ran it with low resolution and rescaled with nvvidconv:

gst-launch-1.0 videotestsrc ! video/x-raw, width=320,height=240,framerate=60/1 ! nvvidconv ! 'video/x-raw(memory:NVMM)' ! nvvidconv ! video/x-raw,format=BGRx,width=640,height=480,framerate=60/1 ! videoconvert ! video/x-raw,format=BGR,width=640,height=480,framerate=60/1 ! fakesink

Running tegrastats, I see it takes about 75% of one core (running 15W 6cores mode without boosting clocks).

Adding your encoding pipeline:

gst-launch-1.0 videotestsrc ! video/x-raw, width=320,height=240,framerate=60/1 ! nvvidconv ! 'video/x-raw(memory:NVMM)' ! nvvidconv ! video/x-raw,format=BGRx,width=640,height=480,framerate=60/1 ! videoconvert ! video/x-raw,format=BGR,width=640,height=480,framerate=60/1 ! videoconvert ! video/x-raw,format=BGRx ! nvvidconv ! omxh265enc bitrate=102400 iframeinterval=5 preset-level=1 control-rate=1 qp-range="0,35:20,35:20,35" ! h265parse ! rtph265pay config-interval=1 ! udpsink sync=false async=false port=1455 host=0.0.0.0

it takes about 100% of one core. So the format conversion/encoding/packetization seems to take only 25% of a core, which looks correct for 640x480@60 fps.
Note that AGX Xavier has 8 CPU cores, so there should be room for your custom processing.

[EDIT: retesting now I see baseline at 25% for simulating your source and 50% with your full case. I suppose the difference was due to some ubuntu software running in background.]

Thank Honey Patouceul’s check and analysis.

Hi,
Since hardware MVMM buffer does not support BGR, we have to convert it to RGBA first and copy to MVMM buffer. This takes significant CPU usage.
In the pipeline, it runs

appsrc ! video/x-raw,format=BGR ! videoconvert ! video/x-raw,format=RGBA ! nvvidconv ! video/x-raw(memory:NVMM) ! ...

You can try

appsrc ! video/x-raw,format=RGBA ! nvvidconv ! video/x-raw(memory:NVMM) ! ...

If we can put data in RGBA directly, should save some CPU usage.

And we have deprecated omx plugins. Please use v4l2 plugins.