Pipeline lags with nvargus while using NVMM

Hello,

I designed a pipeline, which needs to run at least at 30 fps when reading from 2 IMX477 cameras. To achieve this, I make sure to read and write all frames to the NVMM.

Yet, surprisingly, my pipeline runs fine (i.e. with an average of 31.5 fps) when reading source frames from mp4 files, but is slower when I read from cameras using nvarguscamerasrc (27.5 fps on average, but with many visible lags). I expected the contrary, since reading from mp4 requires decoding, then copying frames from CPU to GPU memory, while nvargus writes input frames directly to the NVMM.

Worse, I noticed that the observed lags with nvargus (sometimes up to 4 seconds!) are caused by my plugin receiving and outputting several times the same input frames. Hence, the observed lags are much more visible than what one can expect from the measured fps.

Is there something wrong in my nvargus-based pipeline?

Below are the used pipelines:

  • To read from mp4 (test pipeline):
gst-launch-1.0 -e  \
filesrc location=$path1 ! qtdemux ! h264parse ! omxh264dec ! nvvidconv ! video/x-raw\(memory:NVMM\),format=RGBA,width=3024,height=2280 ! mix. \
filesrc location=$path2 ! qtdemux ! h264parse ! omxh264dec ! nvvidconv ! video/x-raw\(memory:NVMM\),format=RGBA,width=3024,height=2280 ! mix. \
my_plugin name=mix ! \
video/x-raw\(memory:NVMM\),format=RGBA,width=6048,height=2280 ! \
nvvidconv ! video/x-raw\(memory:NVMM\),format=RGBA,width=1512,height=570 ! \
autovideosink sync=false
  • To read from cameras:
gst-launch-1.0 -e  \
nvarguscamerasrc sensor_id=0 ! video/x-raw\(memory:NVMM\),format=NV12,width=3024,height=2280,framerate=30/1 ! nvvidconv flip-method=2 ! video/x-raw\(memory:NVMM\),format=RGBA ! mix. \
nvarguscamerasrc sensor_id=1 ! video/x-raw\(memory:NVMM\),format=NV12,width=3024,height=2280,framerate=30/1 ! nvvidconv flip-method=2 ! video/x-raw\(memory:NVMM\),format=RGBA ! mix. \
my_plugin name=mix ! \
video/x-raw\(memory:NVMM\),format=RGBA,width=6048,height=2280 ! \
nvvidconv ! video/x-raw\(memory:NVMM\),format=RGBA,width=1512,height=570 ! \
autovideosink sync=false

Additional notes in case it helps:

  • my_plugin consumes 26.5 ms/frame on average, with peaks up to 31 ms. This is close to the theoretical limit to run at 30fps. Yet, since I seamlessly reach the 30fps from mp4 files, my understanding is the problem is elsewhere.
  • The final downscaling to 1512*570px in both pipelines above just aims at rendering on screen for debug purposes, but is to be replaced with another plugin later.
  • The reported fps are computed by my own plugin, based on timestamps and a frame counter each time it is called by the pipeline.

Here is my configuration:

  • Hardware: Jetson Nano Developer Kit, SoC: tegra 210
  • JetPack: 4.6.1 (note we cannot update the JetPack, since our camera drivers are not supported by newer JetPack versions), L4T 32.7.1, Ubuntu 18.04.6 LTS
  • gstreamer version: 1.14.5
  • 2 identical cameras: IMX477-160, 12.3MPixels

Thanks.

hello ronan.danno1,

it looks you’re having several video converts in your gst pipeline, which might be the root cause of observed lags.
please try below for rendering preview stream to display monitor directly.
for example,
$ gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM),width=1920, height=1080, framerate=30/1, format=NV12' ! nvoverlaysink -ev

Hello JerryChang,

Thanks for your reply. The pipeline you propose works smoothly, even if I set a size of width=3024, height=2280.

However, the pipeline from mp4 files has as many calls to nvvidconv, and works well.
I can’t figure out why it is slower with nvargus than from mp4 (note my mp4 files also have a 3024*2280 resolution)

Is this something generally expected, or is it a problem specific to my case?

hello ronan.danno1,

you may narrow down the issue by using fpsdisplaysink to report the frame-rate, to confirm which plugin cause the delay.
here’s an example to disable preview and shows frame-rate only
$ gst-launch-1.0 nvarguscamerasrc sensor-id=0 ! 'video/x-raw(memory:NVMM),width=1920, height=1080, framerate=30/1, format=NV12' ! nvvidconv ! 'video/x-raw(memory:NVMM),format=I420' ! fpsdisplaysink text-overlay=0 video-sink=fakesink sync=0 -v

another thought is…
are these two cameras have hardware synchronization pin?
it looks you’re having left/right frame combination use-case, is it out-of-sync scenario? so that it may buffer waiting for left (or right) frames has arrived then rendering output frame.

Hello JerryChang,

Thanks for your answer.

are these two cameras have hardware synchronization pin?
it looks you’re having left/right frame combination use-case, is it out-of-sync scenario? so that it may buffer waiting for left (or right) frames has arrived then rendering output frame.

No, they don’t. However, if I replace my_plugin with nvcompositor (the code of which I used to create my_plugin), I don’t experience such issue.

you may narrow down the issue by using fpsdisplaysink to report the frame-rate, to confirm which plugin cause the delay.

Thanks for the tip! The weird thing is that, even though my FPS remains good (>26 fps, which should visually be OK), it looks like my_plugin receives several time the same images in input, thus generating several times the same output image, even though its FPS looks correct.

I could reproduce the problem with a basic CUDA kernel:

__global__ void cuda_process_kernel(
	unsigned char* pimg1Data, int32_t img1Width, int32_t img1Height, int32_t img1Pitch,
	unsigned char* pimg2Data, int32_t img2Width, int32_t img2Height, int32_t img2Pitch,
	unsigned char* pOutData, int32_t outWidth, int32_t outHeight, int32_t outPitch)
{
	int x = threadIdx.x + blockIdx.x * blockDim.x;
	int y = threadIdx.y + blockIdx.y * blockDim.y;

	if (x >= outWidth || y >= outHeight)
		return;

	// copy image 1 left-hand side to output image left-hand side
	//if (x > img1Width / 3 && x < 2*img1Width/3 && y > img1Height/3 && y < 2* img1Height / 3) // Runs OK when I handle just a small part of the image
	if (x < outWidth / 2) // lags begin to appear when I uncomment this line and comment the previous one, although I still have 30fps.
	{
	    int xSrc = x * img1Width / outWidth;
	    int ySrc = y * img1Height / outHeight;
	    pOutData[y * outPitch + x * 4 + 0] = pimg1Data[ySrc * img1Pitch + xSrc * 4 + 0];
	    pOutData[y * outPitch + x * 4 + 1] = pimg1Data[ySrc * img1Pitch + xSrc * 4 + 1];
	    pOutData[y * outPitch + x * 4 + 2] = pimg1Data[ySrc * img1Pitch + xSrc * 4 + 2];
	    pOutData[y * outPitch + x * 4 + 3] = pimg1Data[ySrc * img1Pitch + xSrc * 4 + 3];
	}
	// copy image 2 right-hand side to output image right-hand side --> it is really worse if I uncomment this block
/*	else
	{
	    int xSrc = x * img2Width / outWidth;
	    int ySrc = y * img2Height / outHeight;
	    pOutData[y * outPitch + x * 4 + 0] = pimg2Data[ySrc * img2Pitch + xSrc * 4 + 0];
	    pOutData[y * outPitch + x * 4 + 1] = pimg2Data[ySrc * img2Pitch + xSrc * 4 + 1];
	    pOutData[y * outPitch + x * 4 + 2] = pimg2Data[ySrc * img2Pitch + xSrc * 4 + 2];
	    pOutData[y * outPitch + x * 4 + 3] = pimg2Data[ySrc * img2Pitch + xSrc * 4 + 3];
	}*/
}

This kernel is launched in function ProcessFrame as follows:

CUresult EGLImageToFrame(EGLImageKHR egl_image, CUgraphicsResource* pResource, CUeglFrame* pegl_frame, CUgraphicsMapResourceFlags mapFlags)
{
	CUresult cuResult;

	cuResult = cuGraphicsEGLRegisterImage(pResource, egl_image, mapFlags);
	if (cuResult != CUDA_SUCCESS) {
		//g_printerr("cuGraphicsEGLRegisterImage failed: %d: %s\n", cuResult, cudaGetErrorString(cuResult));
		HandleError(cuResult, __FILE__, __LINE__, false);
		return cuResult;
	}

	cuResult = HANDLE_ERROR_NO_ABORT(cuGraphicsResourceGetMappedEglFrame(pegl_frame, *pResource, 0, 0));
	if (cuResult != CUDA_SUCCESS) {
		g_printerr("cuGraphicsResourceGetMappedEglFrame failed.\n");
		return cuResult;
	}

	return cuResult;
}

CUresult ProcessFrame(gint dmabuf_fd1, gint dmabuf_fd2, gint out_dmabuf_fd)
{
	auto start = std::chrono::high_resolution_clock::now();
	CUresult res;

	// Get EGL images from dmabuf fds

	EGLImageKHR image1 = NvEGLImageFromFd(egl_display, dmabuf_fd1);
	if (image1 == EGL_NO_IMAGE_KHR)
	{
		g_printerr("NvEGLImageFromFd failed for image 1\n");
		return CUDA_ERROR_UNKNOWN;
	}

	EGLImageKHR image2 = NvEGLImageFromFd(egl_display, dmabuf_fd2);
	if (image2 == EGL_NO_IMAGE_KHR)
	{
		g_printerr("NvEGLImageFromFd failed for image 2\n");
		return CUDA_ERROR_UNKNOWN;
	}

	EGLImageKHR out_image = NvEGLImageFromFd(egl_display, out_dmabuf_fd);
	if (out_image == EGL_NO_IMAGE_KHR)
	{
		g_printerr("NvEGLImageFromFd failed for output image\n");
		return CUDA_ERROR_UNKNOWN;
	}

	// Get EGL frames from EGLImageKHR images
	cudaFree(0); // Required to use CUDA context in this thread

	CUeglFrame eglFrame1, eglFrame2, eglFrameOut;
	CUgraphicsResource pResource1 = NULL, pResource2 = NULL, pResourceOut = NULL;

	if ((res = EGLImageToFrame(image1, &pResource1, &eglFrame1, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY)) != CUDA_SUCCESS)
	{
		g_printerr("EGLImageToFrame failed for image 1\n");
		return res;
	}

	if ((res = EGLImageToFrame(image2, &pResource2, &eglFrame2, CU_GRAPHICS_MAP_RESOURCE_FLAGS_READ_ONLY)) != CUDA_SUCCESS)
	{
		g_printerr("EGLImageToFrame failed for image 2\n");
		return res;
	}

	if ((res = EGLImageToFrame(out_image, &pResourceOut, &eglFrameOut, CU_GRAPHICS_MAP_RESOURCE_FLAGS_WRITE_DISCARD)) != CUDA_SUCCESS)
	{
		g_printerr("EGLImageToFrame failed for output image\n");
		return res;
	}

	CUresult cuResult = HANDLE_ERROR_NO_ABORT(cuCtxSynchronize());
	if (cuResult != CUDA_SUCCESS) {
		g_printerr("cuCtxSynchronize failed.\n");
		return cuResult;
	}

	// Check that all input frames are of the same size
	if (eglFrame1.width != eglFrame2.width || eglFrame1.height != eglFrame2.height)
	{
		g_printerr("Input images have different sizes: %d*%d vs %d*%d\n", eglFrame1.width, eglFrame1.height, eglFrame2.width, eglFrame2.height);
		return CUDA_ERROR_INVALID_VALUE;
	}

	// Process the frame
	if (eglFrame1.frameType == CU_EGL_FRAME_TYPE_PITCH &&
		eglFrame2.frameType == CU_EGL_FRAME_TYPE_PITCH &&
		eglFrameOut.frameType == CU_EGL_FRAME_TYPE_PITCH)
	{
		if (eglFrame1.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR &&
			eglFrame2.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR &&
			eglFrameOut.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR)
		{
			/* Launch CUDA kernel */
			// Define the execution configuration
			int outWidth = eglFrameOut.width;
			int outHeight = eglFrameOut.height;

			dim3 blockDim(16, 16);
			dim3 gridDim((outWidth + blockDim.x - 1) / blockDim.x, (outHeight + blockDim.y - 1) / blockDim.y);

			// Launch the kernel
			cuda_process_kernel << <gridDim, blockDim >> > (
				(unsigned char*)eglFrame1.frame.pPitch[0], eglFrame1.width, eglFrame1.height, eglFrame1.pitch,
				(unsigned char*)eglFrame2.frame.pPitch[0], eglFrame2.width, eglFrame2.height, eglFrame2.pitch,
				(unsigned char*)eglFrameOut.frame.pPitch[0], eglFrameOut.width, eglFrameOut.height, eglFrameOut.pitch);

		}
		else
			g_printerr("Invalid eglcolorformat\n");
	}
	else
		g_printerr("Invalid frame type\n");

	if (cuCtxSynchronize() != CUDA_SUCCESS)
		g_printerr("cuCtxSynchronize failed\n");

	if (cuGraphicsUnregisterResource(pResource1) != CUDA_SUCCESS)
		g_printerr("cuGraphicsUnregisterResource for image 1 failed\n");

	if (cuGraphicsUnregisterResource(pResource2) != CUDA_SUCCESS)
		g_printerr("cuGraphicsUnregisterResource for image 2 failed\n");

	if (cuGraphicsUnregisterResource(pResourceOut) != CUDA_SUCCESS)
		g_printerr("cuGraphicsUnregisterResource for output image failed\n");

	// Release EGL images
	NvDestroyEGLImage(egl_display, image1);
	NvDestroyEGLImage(egl_display, image2);
	NvDestroyEGLImage(egl_display, out_image);

	// Measure time
        // ... (reported times and fps match the ones reported by `fpsdisplaysink`)
	return CUDA_SUCCESS;
}

It runs at 30 fps with the cameras, yet there are visible lags.
Can there be something wrong in my CUDA kernel, or in the way I launch it?

hello ronan.danno1,

I’m not a CUDA expert, but it seems you could use cuGraphicsEGLRegisterImage and cuGraphicsResourceGetMappedEglFrame to map eglFrame into GPU address.
and… did you have pthread added for multiple threads scenario?

Hello JerryChang,

Thanks for your answer.

I’m not a CUDA expert, but it seems you could use cuGraphicsEGLRegisterImage and cuGraphicsResourceGetMappedEglFrame to map eglFrame into GPU address.

Yes, this is what my EGLImageToFrame function actually does.

did you have pthread added for multiple threads scenario?

No, I do not handle several concurrent threads to handle several frames in my code. My understanding is I should not? I actually merely modified the do_nvcomposite() function from nvcompositor to replace the call to NvBufferComposite with a call to my ProcessFrame function.

I don’t know if the following tests can help:
If I modify my kernel and uncomment line :

if (x > img1Width / 3 && x < 2 * img1Width/3 && y > img1Height/3 && y < 2 * img1Height / 3)

and comment the following one to make it “lighter”, I don’t experience lags. Yet, if I then add the following line:
std::this_thread::sleep_for(std::chrono::milliseconds(5));
after the cuCtxSynchronize() call to simulate a longer computation, my pipeline runs at 20fps, yet displayed frames seem to be updated roughly at 2fps…
I can’t figure out what’s going on…

hello ronan.danno1,

could you please narrow down the issue, please try do simply buffer copy of your 1st camera.
for example, please skip left/right frame combination, just to have buffer copy of your camera frames and rendering to display.

Hello JerryChang,

I can indeed reproduce it if I skip my kernel launch in my ProcessFrame function, and instead use a cudaMemcpy2D(), then sleep for 10ms to simulate some processing, as follows:


	// Process the frame
	if (eglFrame1.frameType == CU_EGL_FRAME_TYPE_PITCH &&
		eglFrame2.frameType == CU_EGL_FRAME_TYPE_PITCH &&
		eglFrameOut.frameType == CU_EGL_FRAME_TYPE_PITCH)
	{
		if (eglFrame1.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR &&
			eglFrame2.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR &&
			eglFrameOut.eglColorFormat == CU_EGL_COLOR_FORMAT_ABGR)
		{
			// copy left image to output image, without launching a CUDA kernel
			cudaMemcpy2D((unsigned char*)eglFrameOut.frame.pPitch[0], eglFrameOut.pitch,
				(unsigned char*)eglFrame1.frame.pPitch[0], eglFrame1.pitch,
				eglFrame1.width * 4, eglFrame1.height, cudaMemcpyDeviceToDevice);
		}
		else
			g_printerr("Invalid eglcolorformat\n");
	}
	else
		g_printerr("Invalid frame type\n");

	if (cuCtxSynchronize() != CUDA_SUCCESS)
		g_printerr("cuCtxSynchronize failed\n");

	// sleep for 10 ms to simulate processing time. Displayed frames are updated roughly at 2 fps, while the plugin processes at 28 fps
	// If I comment this line, the display matches the processing speed, i.e. 30 or 31 fps
	std::this_thread::sleep_for(std::chrono::milliseconds(10));

	if (cuGraphicsUnregisterResource(pResource1) != CUDA_SUCCESS)
		g_printerr("cuGraphicsUnregisterResource for image 1 failed\n");
(etc...)

Without the sleep_for function, or if I set it to sleep for 5ms, the display is smooth (i.e. updated at 30 fps, conform to what I measure).
Yet, when I set it to sleep for 10ms, there is a dramatic change: while the measured fps is still about 30fps (28 fps more precisely), the display shows many lags, which visually results in ~2 fps.
Is such behavior to be expected?

I could reproduce the problem with two IMX219 cameras.
I’ll prepare a stand-alone project so it can be easily reproduced.

please note that, according to the tests as mentioned by comment #3 and comment #5.
since this cannot repo’ed by having two cameras to launch via nvarguscamerasrc directly.
this is issue of your CUDA code implementation.

Hi JerryChang,

Sorry for my delay.
Yes, I agree this would not occur if my plugin could run faster, which I’ve also been striving to do for a few weeks.

fpsdisplaysink shows the pipeline runs at 27-28 fps. This is lower than the input 30fps sent by the cameras, yet not so far. I wish the pipeline could therefore skip 2-3 frames every second to have a visually smooth result. Yet, the results look like the attached video (lagged_video_no_queue.mp4).
Uploading: lagged_video_no_queue.mp4…

I tried to add a queue max-size-buffers=1 leaky=downstream after nvarguscamerasrc as follows, but this is not much better (see
Uploading: lagged_video_with_queue.mp4… ):

gst-launch-1.0 -e  \
nvarguscamerasrc sensor-id=0 ! video/x-raw\(memory:NVMM\),format=NV12,width=1920,height=1080,framerate=30/1 ! queue max-size-buffers=1 leaky=downstream ! \
nvvidconv ! video/x-raw\(memory:NVMM\),format=RGBA ! \
nvcompositor ! \          # <-- /!\ this is a recompiled version of nvcompositor, in which I added a g_usleep() to simulate processing time
video/x-raw\(memory:NVMM\),format=RGBA,width=6048,height=2280 ! \
nvvidconv ! video/x-raw\(memory:NVMM\),format=RGBA,width=1512,height=570 ! \ # this additional conversion adds load on the GPU, but is necessary for display
fpsdisplaysink text-overlay=0 video-sink=autovideosink sync=0 -v

Note: To make sure this is not related to some specificity in my own plugin implementation, I reproduced the problem with a “customized” nvcompositor (I mostly added a g_usleep), which you can find here if needed :
custom_nvcompositor.zip (37.5 KB)

Is there any chance to tweak my pipeline to have a “smooth” display?

It looks like the videos were not correctly uploaded. Here they are: