GPU Acceleration Support for OpenCV Gstreamer Pipeline


We run following the pipeline in OpenCV by using Raspberry Pi HQ camera.

gst_pipeline = "nvarguscamerasrc ! video/x-raw(memory:NVMM), width=4032, height=3040, format=(string)NV12, framerate=30/1 ! nvvidconv flip-method=0 ! video/x-raw, width=4032, height=3040, format=(string)BGRx ! videoconvert ! video/x-raw, format=(string)BGR ! appsink

cv2.VideoCapture(gst_pipeline,  cv2.CAP_GSTREAMER)

The camera supposed to give 12MP@30fps, but we only get 15fps with full camera resolution. It is said to us that 30fps performance is only achievable when capturing with gstreamer pipeline itself.

I wonder if there is a GPU acceleration way for OpenCV to achieve better performances.

And would you also recommend any changes in this pipeline to increase the overall performance, such as reducing CPU load, RAM usage, latency, high visual quality etc…

Thank you

An application linked to opencv would only receive frames into CPU allocated cv::Mat using opencv videoio. This is not very fast and would not work for high resolutions/framerates. Same applies to CPU Mat processing.

You would indeed use GPU processing from NVMM memory in gstreamer in such case.

You can access NVMM buffers with gstreamer plugin nvivafilter. It is intended to perform CUDA operations on NVMM hosted frames, so you can use it with opencv CUDA. You would have to output RGBA frames from this plugin. You may have a look to this example.

Also note that you can directly access from gstreamer buffer if your application builds the gstreamer pipeline.

1 Like

Additional note: The main bottleneck is opencv videoio. Another alternative is to use @dusty_nv 's jetson-utils library having much more efficient implementation.
If you’ve built and installed jetson-inference, it should already be installed in your Jetson. Note that this assumes a recent version with various video sources support, so be sure you have a version pulled after end of June 2020.

The following example reads frames from CSI camera, creates an opencv GpuMat with received image, in GPU converts BGR into HSV, extracts H for applying a binary threshold, then converts back to RGB and finally displays the transformed frame:

#include <iostream>
#include <vector>

#include <jetson-utils/videoSource.h>
#include <jetson-utils/videoOutput.h>

#include <opencv2/opencv.hpp>
#include "opencv2/cudaarithm.hpp"
#include "opencv2/cudaimgproc.hpp" 

int main(int argc, char **argv) {

	// create input stream
	videoOptions opt;
	opt.width  = 3264;
	opt.height = 2464;
	opt.frameRate = 21;
	opt.zeroCopy = false; // GPU access only for better speed
	videoSource * input = videoSource::Create("csi://0", opt);
	if (!input) {
		std::cerr << "Error: Failed to create input stream" << std::endl;

	// create output stream
	videoOutput* output = videoOutput::Create("display://0");
	if( !output ) {
		std::cerr << "Error: Failed to create output stream" << std::endl;
		delete input;

	// Read one frame to get resolution
	uchar3* image = NULL;
	if( !input->Capture(&image, 1000) )
		std::cerr << "Error: failed to capture first video frame" << std::endl;
		delete output;
		delete input;

	 * processing loop
	cv::cuda::GpuMat d_Mat_HSV(input->GetHeight(), input->GetWidth(), CV_8UC3);
	std::vector<cv::cuda::GpuMat> d_hsv(3);
	double prev = (double) cv::getTickCount();
	while( 1 )
		// capture next image
		if( !input->Capture(&image, 1000) )
			std::cerr << "Error: failed to capture video frame" << std::endl;
		// log timing
		double cur = (double) cv::getTickCount();
		double delta = (cur - prev) / cv::getTickFrequency();
		std::cout<<"delta=" << delta << std::endl;

		// Some OpenCv processing
		cv::cuda::GpuMat frame_in(input->GetHeight(), input->GetWidth(), CV_8UC3, image);
		cv::cuda::cvtColor(frame_in, d_Mat_HSV, cv::COLOR_RGB2HSV);
		cv::cuda::split(d_Mat_HSV, d_hsv);
		cv::cuda::threshold(d_hsv[0], d_hsv[0], 100, 255, cv::THRESH_BINARY);
		cv::cuda::merge(d_hsv, d_Mat_HSV);
		cv::cuda::cvtColor(d_Mat_HSV, frame_in, cv::COLOR_HSV2RGB);

		// Display result
		output->Render((uchar3*), input->GetWidth(), input->GetHeight());
		if( !output->IsStreaming() )
		if( !input->IsStreaming() )

	delete input;
	delete output;
   	return 0;

I built against opencv-4.4.0-pre installed in /usr/local/opencv-4.4.0-pre, so:

g++ -std=c++11 -Wall -I/usr/local/opencv-4.4.0-pre/include/opencv4 -I/usr/local/cuda/targets/aarch64-linux/include test-jetson-utils-opencv.cpp -L/usr/local/opencv-4.4.0-pre/lib -lopencv_core -lopencv_cudaarithm -lopencv_cudaimgproc -ljetson-utils -o test-jetson-utils-opencv

My camera can only run at 21fps with this resolution, but it seems to work fine.


trying with default preinstalled opencv from JP_4.4_GA
git clone
cd jetson-inference
git submodule update --init
mkdir build
cd build
cmake ..
make -j8
sudo make install

g++ -std=c++11 -Wall -I/usr/local/opencv-4.3.0-dev/include/opencv4 -I/usr/local/cuda/targets/aarch64-linux/include test-jetson-utils-opencv.cpp -L/usr/local/opencv-4.3.0-dev/lib -lopencv_core -lopencv_cudaarithm -lopencv_cudaimgproc -ljetson-utils -o test-jetson-utils-opencv
./test-jetson-utils-opencv: error while loading shared libraries: cannot open shared object file: No such file or directory

previously, before installing jetson inference I used to build examples with the command below, which still works

g++ -o simple_opencv -Wall -std=c++11 simple_opencv.cpp $(pkg-config --cflags --libs opencv4)
GST_ARGUS: Creating output stream
CONSUMER: Waiting until producer is connected...

seems some PATH missed

It seems you’ve installed opencv into /usr/local/opencv-4.3.0-dev, which is not a default path for libs.

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/opencv-4.3.0-dev/lib

and retry.


does the quality of the image seem reasonable? using the default AGX CSI sensor with the default code above?

The binary threshold on Hue space has no more sense than making this example… The result may depend on the lightening and colors of objects.
You would comment it or comment the whole opencv processing for checking.

I anticipate that opencv 4.3.0-dev got installed by the jetson-inference package, as before the inference it was a default stock 4.4 GA OS with the preinstalled opencv. Probably I installed somehow the opencv vbersion a while ago somehw. I shall try some other day so that there will be more sun, also on another device [NX] that has the stock OS from 4.4.GP.
Thank you very much!

Hi @Andrey1984, I no longer install OpenCV libraries in jetson-inference install script. And even when it did, it would have pulled OpenCV 3.2 (which is the version in Ubuntu Bionic apt repo).

Thank you for letting me know!
The hyphotesis seems turned out to be a ‘false positive’,


If using sensor OV5693, you would also adjust resolution and framerate for this sensor thru nvarguscamerasrc, if not already done.

yes, adjusted;
various lighting would show different images;
some of them looked like segmentation;
some of them had clear picture;
depend on lighttning

Yes, Argus may try to optimize digital gain, exposure and wbBalance. This may lead to out of expected color range in low lightening conditions.
I don’t know any software options to set these unless recreating a dirty patch, so I’d adivise to keep lightening high, unless someone tells how to do.
Segmentation-like images are expected for a binary threshold on H. It should set color to be either green or red.

Hi @dusty_nv, @Honey_Patouceul
I used gstreamer+opencv for decoding the RTSP stream with python codes. like that:

gstream_elemets = (
    'rtspsrc location=rtsp latency=300 !'
    'rtph264depay ! h264parse ! '
    'omxh264dec !'
    'video/x-raw(memory:NVMM),format=(string)NV12 !'
    'nvvidconv ! video/x-raw , format=(string)BGRx !'
    'videoconvert !'
    'appsink sync=0').
cv2.VideoCapture(gstream_elemets, cv2.CAP_GSTREAMER)

As you know that’s not very efficient way for decoding, because I copied the decoded frames from NVMM buffer to CPU buffer, that cause jetson used more memory for decoding.
Q1- I before tested the deepstream for multi-stream, and that not used any memory for decoding, because this sdk used NVMM buffer directory for GPU processing, I want to know, Is there a way to use opencv + gstream python without CPU buffer copy? I compiled the opencv 4.1.1 for CUDA support.
If there is not way with opencv + gstreamer, Is these a other solution for decoding the streams without copying to CPU buffer especially python code?

Q2 - If I want to connect USB Coral TPU for other processing, I have to bring into the decodes frames from NVMM buffer to CPU buffer?

Q3- In this diagram, batching of frames is done over CPU, I want to know, even in this way(deepstream solution), for gathering batch of frames, the decoded frames copied from NVMM buffer to CPU buffer, right? what’s difference between this solution and opencv+gstreamer solution? In both way the decoded frames bring into NVMM buffer into CPU buffer, right?

Q4- For scaling and cropping part of diagram, I also can use nvvidconv plugin in gstreamer+opencv to do these operation, I want to know this plugin in gstreamer+opencv use VIC HW?

Q5- Is it possible to access NVMM buffer from CPU?

Finally, I looking for a best python solution for multi-stream decoding without 2 times copied in memory from NVMM buffer to CPU buffer, I want to use in USB Coral TPU and jetson GPU.

Q6- nvivafilter plugin has post/pre processing, How does it? do custom pre/post processing? get function for do?

You would have to output RGBA frames from this plugin.

this plugin is like nvvidconv plugin, right?

1 Like

The post is duplicate of

1 Like