Real-time CLAHE processing of video, framerate issue. Gstreamer + nvivafilter + OpenCV

Hi all,
I have a 4x 5MP camera solution streaming using the Xavier 32GB. I have implemented CLAHE through nvivafilter, however I’m not able to get the wanted 24FPS per camera, I benchmark 13-14 FPS so I’m looking into how to improve this. Without the nvivafilter part I manage 24FPS with no problem.

Relevant part of my pipeline looks like the following when configured:

gst-launch-1.0 v4l2src device=/dev/video0 ! ‘video/x-raw, format=BGRx’ ! nvvidconv ! ‘video/x-raw(memory:NVMM), format=NV12’ ! nvivafilter customer-lib-name=“libnviva_clahe.so” cuda-process=true ! ‘video/x-raw(memory:NVMM), format=(string)RGBA’ ! nvvidconv ! ‘video/x-raw(memory:NVMM), format=NV12’ !

The relevant part of the nvivafilter gpu_process is the following:

static Ptr<cv::cuda::CLAHE> clahe;
static GpuMat gpuframe_3channel(height, width, CV_8UC3);
vector yuv_planes(3);

if (!clahe) {
clahe = cv::cuda::createCLAHE(2.5, Size(6,6));
}
GpuMat d_mat(height, width, CV_8UC4, pdata);
cv::cuda::cvtColor(d_mat, gpuframe_3channel, CV_BGR2YUV, 3);
cv::cuda::split(gpuframe_3channel, yuv_planes);
clahe->apply(yuv_planes[0], yuv_planes[0]);
cv::cuda::merge(yuv_planes, gpuframe_3channel);
cv::cuda::cvtColor(gpuframe_3channel, d_mat, CV_YUV2BGR, 4);

I’m looking for ways to optimize this pipeline. F.ex. I have to go through NV12->RBGA->YUV-RBGA->NV12 to make this work. It would be much better to do NV12->YUV->NV12, but I can’t seem to figure out how to process NV12 into OpenCV and back again… Other suggestions on how to improve framerate here would be much appreciated! :)

Not extensively tested, carefully check yourself…

In gpu_process(), you would have something like:

...
  if (eglFrame.frameType == CU_EGL_FRAME_TYPE_PITCH) {
    if (eglFrame.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR) {
       cv_process_RGBA(eglFrame.frame.pPitch[0], eglFrame.width, eglFrame.height);
    } else if (eglFrame.eglColorFormat == CU_EGL_COLOR_FORMAT_YUV420_SEMIPLANAR) {
       cv_process_NV12(eglFrame.frame.pPitch, eglFrame.width, eglFrame.height);
    } else
       printf ("Invalid eglcolorformat %d\n", eglFrame.eglColorFormat);
  }
...

and have a previously defined cv_process_NV12() function similar to this:
[EDIT: The following is buggy. Skip to post#5 for correct implementation.]

static std::vector<cv::cuda::GpuMat> uv(2);
static void cv_process_NV12(void** pPitch, int32_t width, int32_t height) {
    cv::cuda::GpuMat d_Mat_Y(height, width, CV_8UC1, pPitch[0]);

    // U and V are interleaved
    cv::cuda::GpuMat d_Mat_UV(height/2, width/2, CV_8UC2, pPitch[1]);
    cv::cuda::split(d_Mat_UV, uv);
    cv::cuda::GpuMat d_Mat_U = uv[0];
    cv::cuda::GpuMat d_Mat_V = uv[1];

    // Some process... here just setting a kind of blue
    d_Mat_Y.setTo(127);
    d_Mat_U.setTo(210);
    d_Mat_V.setTo(32);

    // reinterleave U&V
    cv::cuda::merge(uv, d_Mat_UV);

 
    // Final check
    if (d_Mat_Y.data != (uchar*) pPitch[0])
	   std::cerr << "Error: reallocated buffer for d_Mat_Y" << std::endl;
    if (d_Mat_UV.data != (uchar*) pPitch[1])
	   std::cerr << "Error: reallocated buffer for d_Mat_UV" << std::endl;
}

Since your case just uses Y, you may remove the color part and get a nice speedup.

You would have to set NV12 format in OUTPUT caps of nvivafilter for this:

gst-launch-1.0 nvarguscamerasrc ! 'video/x-raw(memory:NVMM), width=1280, height=720' ! nvvidconv ! nvivafilter customer-lib-name=./lib-gst-custom-opencv_cudaprocess.so cuda-process=true ! 'video/x-raw(memory:NVMM), format=NV12' ! nvegltransform ! nveglglessink
1 Like

Thank you for a swift and good response. First, for anybody else wanting to try the code you posted, the uv_planes were not declared, but except for that it worked nicely. I’m able to implement CLAHE on the Y plane, however, I’m experiencing artifacts. Running the code underneath I expect a a gray image, however, I am getting a grey image with some vertical lines which I’m not seeing if I don’t run cv_process_nv12. The artifacts gets highlighted with the CLAHE processing and destroys the image.

static void cv_process_nv12(void **pPitch, int32_t width, int32_t height)
{
cv::cuda::GpuMat d_Mat_Y(height, width, CV_8UC1, pPitch[0]);

// U and V are interleaved
vector<GpuMat> uv_planes(2);
cv::cuda::GpuMat d_Mat_UV(height/2, width/2, CV_8UC2, pPitch[1]);
cv::cuda::split(d_Mat_UV, uv_planes);
cv::cuda::GpuMat d_Mat_U = uv_planes[0];
cv::cuda::GpuMat d_Mat_V = uv_planes[1];

// Set image to gray
d_Mat_U.setTo(0x80);
d_Mat_V.setTo(0x80);
// reinterleave U&V
cv::cuda::merge(uv_planes, d_Mat_UV);

}

Running cv_process_nv12:


Not running cv_process_nv12:

My anwser was weird, I’m afraid. Testing with a solid color didn’t show the mess I see when trying to draw a rectangle only. Pixel layout may be a bit more complex than thought.
I’ll try to investigate this later and post here if I can solve it.
Someone else with better knowledge about NV12 format in egl frame may also advise.

Also note that in my example the vector of GpuMat is declared just above the function as static. For performance, and this applies for your filter, avoid to allocate in the loop. Better declare as static variables, allocate in Init function and just use in process function.

I think I got it… Seems there is 256 bytes stride. This should be better (so far, not tested so much):

static std::vector<cv::cuda::GpuMat> uv(2);
static void cv_process_NV12(void** pPitch, int32_t width, int32_t height) {

    //printf ("cv_process_NV12  %d x %d\n", width, height);

    const int stride = 256;
    int num_strides = ((int)width)/stride;
    int use_width = num_strides*stride;
    if (use_width < (int)width)
	   use_width += stride;
    int use_height = height;

 
    cv::cuda::GpuMat d_Mat_Y(use_height, use_width, CV_8UC1, pPitch[0]);

    // U and V are interleaved
    cv::cuda::GpuMat d_Mat_UV(use_height/2, use_width/2, CV_8UC2, pPitch[1]);
    cv::cuda::split(d_Mat_UV, uv);
    cv::cuda::GpuMat d_Mat_Cb = uv[0];
    cv::cuda::GpuMat d_Mat_Cr = uv[1];

    // Some process... here just setting a kind of blue
    cv::Rect Yroi(0, 0, 100, 100);
    cv::Rect UVroi(0, 0, 50, 50);  

    d_Mat_Y(Yroi).setTo(100);
    d_Mat_Cr(UVroi).setTo(0);
    d_Mat_Cb(UVroi).setTo(255);

    // reinterleave U&V
    cv::cuda::merge(uv, d_Mat_UV);

 
    // Final check
    if (d_Mat_Y.data != (uchar*) pPitch[0])
	   std::cerr << "Error: reallocated buffer for d_Mat_Y" << std::endl;
    if (d_Mat_UV.data != (uchar*) pPitch[1])
	   std::cerr << "Error: reallocated buffer for d_Mat_UV" << std::endl;
}

Be aware that there may be a (black?) border on right side of image because of stride padding.
If this fools your filter, you may get your original Y mat with:

cv::Rect Yroi(0, 0, width-1, height-1);
d_Mat_Original_Y=d_Mat_Y(Yroi);
// Apply your filter on original now