Gstreamer lag increases with frame rate recording from Raspberry Pi camera (nvarguscamerasrc and nvivafilter)

Thanks for clarifying that you have optimised the stack. That said, something is buffering like heck in my gstreamer pipeline. I will present how I am testing latency and then provide results.

I am using the nvivafilter gstreamer plugin to run my own CUDA C++ code; this CUDA code is based on the nvsample_process.cu source file provided by Nvidia.

My main() function sets up a gstreamer pipeline as follows - note that there are 2 pipelines that I am testing. 1 of them saves the video to disk while the other does not. I find the framerate is much reduced and the latency is much increased when I save video to disk at 120fps. At 30fps I do not see any framerate reduction or increased latency beyond the expected 1 frame of latency. If only that were the case at 120fps…

char str1[] = "nvarguscamerasrc exposurecompensation=1 gainrange='8 16' ! video/x-raw(memory:NVMM), width=(int)1280, height=(int)720, format=(string)NV12, framerate=(fraction)120/1 ! nvivafilter cuda-process=true pre-process=false post-process=false customer-lib-name=libnvsample_cudaprocess.so ! video/x-raw(memory:NVMM), format=(string)NV12 ! fakesink";
char str2[] = " nvarguscamerasrc exposurecompensation=1 gainrange='8 16' ! video/x-raw(memory:NVMM), width=(int)1280, height=(int)720, format=(string)NV12, framerate=(fraction)120/1 ! nvivafilter cuda-process=true pre-process=false post-process=false customer-lib-name=libnvsample_cudaprocess.so ! video/x-raw(memory:NVMM), format=(string)NV12 ! omxh264enc ! qtmux ! filesink location=test.mov";
GstBus *bus = NULL;
GstElement *pipeline  = NULL;
GstMessage *msg  = NULL;

pipeline = gst_parse_launch(strX, NULL);
bus = gst_element_get_bus(pipeline);
gst_element_set_state(pipeline, GST_STATE_PLAYING);

The pseudo code for the .so file that runs the CUDA code is similar to this:


#define FLASH_FRAME   100

gpu_process(void)
{
   static int frameCount = 0;
   
   if(frameCount == FLASH_FRAME)
   {
      // Start a thread that toggles the GPIO to flash a bright white LED against a wall. Save the system time in microseconds using gettimeofday().
   }
   
   // Run CUDA code to analyse the video frame to see if the image has increased in brightness.

   if(frameCount > FLASH_FRAME)
   {
       // Check if the CUDA code discovered an increase in brightness.
       // If it did, save the system time in microseconds using gettimeofday() and measure the delay and print this plus the average FPS since the first time the function was called. Quit the program.
   }
   
   frameCount++;
}

As I said before, at 30fps everything is fine because both str1 and str2 result in 30fps in reality with about 33ms delay from the GPIO toggling the flash to it appearing in the video. I trust the CUDA algorithm to check the brightness change correctly because I draw a small rectangle using CUDA when I toggle the GPIO and can check the saved footage for this rectangle and also for the increased brightness caused by the flash (usually seen in the next frame at 30fps).

The problem is at 120fps I get very disappointing results. I am measuring CPU during these tests and never does any single core go above 60% utilisation. I see significant latency (==lag) when using str2 (i.e. encoding and saving H264 video) at 120fps. If I wait longer to run the flash test, i.e. increase FLASH_FRAME, I see that the latency increases too. str2 results in a frame rate of only 50-60 frames being processed per second and not the 120 (or thereabouts) that I would expect. The saved video shows 113fps which is confusing.

It is as if a traffic jam is occurring as time goes on. As you can see earlier in the thread, I have tried many different things to reduce this latency but at 120fps I struggle to get good results which is really disappointing and makes my end product (computer vision drone application) ineffective.

The results are thus:
-if FLASH_FRAME = 1 (i.e. first frame), I see about 100ms latency
-if FLASH_FRAME = 100 I see about 200ms latency.
-for FLASH_FRAME > 100 there is no further increase in latency.

i.e. in a worst case, there are 200ms worth of frames being buffered somewhere in the pipeline.

The test environment is more complicated than as described though. In actuality, I first of all have a CUDA algorithm running based on the 12_camera_v4l2_cuda example that Nvidia provides with L4T (this grabs frames from a USB webcam and processes them with CUDA). At the same time as this, the gstreamer pipeline is running but I use a boolean variable to prevent the CUDA code in the gstreamer pipeline from running until the 12_camera_v4l2_cuda code has properly stopped. Yes, I am running jetson_clocks.

Sorry for the long post, but I hope it describes in detail the problem I am experiencing.

Best regards and thanks for reading.

I think this is a key piece of this puzzle. The saved video is definitely about 110fps. I know this because I filmed a stopwatch. HOWEVER, only 57fps (or thereabouts) are passed to the gpu_process() method of nvsample_cudaprocess.so by nvivafilter.

I believe (one of) the problem(s) is with nvivafilter which appears to be only passing half the frames to my CUDA code.

I say “one of” because while this would increase the latency, it doesn’t account for such a large delay as 200ms.

Edit: it is entirely possible that the latency could also be accounted for in nvivafilter, a gstreamer element that I do not have the source code to.

Have you tried profiling your CUDA kernel with nvprof?

1 Like

“Unified memory profiling is not supported on the Jetson due to the different memory management system.”

i.e. nvprof will not work on Jetson.

Thanks for the suggestion though.

At gstreamer requested frame rate of 120fps and saving H264 to disk, I did some more digging and found that although my algorithm (CPU+GPU) takes only about 6.5ms, there are other things to account for. The call in gpu_process() to cuCtxSynchronize() takes 8.2ms which is a massive delay.

But guess what? If I do not save H264 to disk and request 120fps from gstreamer, I see that the call to cuCtxSynchronize() takes 0ms!!! Clearly the H264 saving is affecting cuCtxSynchronize().

gpu_process() is my CPU+GPU algorithm based on Nvidia sample code (nvsample_cudaprocess.cu) that processes the frame passed to it by the nvivafilter gstreamer element.

All in all, gpu_process() is taking up to 17ms which will limit actual frame rate to 60fps. Frame capture itself at 120fps only takes 1.8ms.

This begs 2 questions regarding the 120fps gstreamer pipeline that CUDA processes (nvivafilter) and saves the video to disk as H264:

  1. As always, why is there an almost 200ms lag caused mainly by nvivafilter in passing the frame to gpu_process() CUDA code from the camera?
  2. Is there a way to reduce the time taken by cuCtxSynchronize() in gpu_process()? This time is massively increased when saving H264 but almost zero when not saving H264.

Hi,
Are you able to try 720 60 fps mode? On Jetson Nano + Raspberry Pi camera V2, we enable 720p60. Due to stability and performance concern, 120 fps mode is not enabled in the sensor modes.

1 Like

Hi, I just did 2 tests at 60fps.

First test did not save H264. Just CUDA processing to detect the LED flash with fakesink (str1 but with “60/1” in the comments above). My code (CPU+GPU) in gpu_process() took 5.6ms. Everything else in gpu_process() took 0.7ms with only 0.1ms for cuCtxSynchronize(). A frame rate of 60fps was achieved with a latency of real-life to CUDA of 22ms which is basically 1 frame lag.

Second test was same as above but did save H264 to disk (i.e. str2). My code took 5.8ms (similar as first test) but suddenly I see cuCtxSynchronize() take 11.7ms which reduces framerate to 52fps. Not only that, but latency jumps up to 187ms.

Edit: to be clear, I am referring to the second call to cuCtxSynchronize() which is after the CUDA code has run. The first cuCtxSynchronize() call in gpu_process() is very fast.

As I said before, I believe the latency is caused by nvivafilter, and clearly there is some interaction between the saving of H264* and cuCtxSynchronize().

Thanks.

*disk access is not the cause as discussed previously

Hi, just posting a follow up since I have some interesting information.

My CUDA code was drawing a small rectangle to incoming video, as well as analysing the data for the LED flash. I thought that maybe the act of writing to the captured camera buffer would be the cause of the latency.

Well, it turns out I was right.

By removing any writes to the buffer and only reading from it, the second cuCtxSynchronize() in gpu_process() takes basically 0ms, even at 120fps.

This is a massive discovery for me… it is annoying that I cannot write to the buffer because I am doing computer vision and wish to annotate the video so that the user can debug what the algorithm was doing, but at least I get saved video with minimal lag.

I’d still welcome expert feedback from Nvidia on whether it is possible to write to the buffer with low latency and how to increase the speed of the second cuCtxSynchronize() in gpu_process().

Edit: my CUDA code was split into 2 kernels that run in series. The first kernel analysed video for the flash and coloured some pixels (just Y plane) as it read each pixel. The second kernel simply drew the outline of a rectangle about 100x100 pixels in all 3 planes, YUV. By removing the second kernel I was able to see the vast improvements mentioned above. I can still write to pixels in the first kernel without affecting latency which is weird. Maybe this is an issue with writing to all colour planes?