NvBufferMemSyncForCpu() issue?


In our system the MIPI inputs of the Tegra TX2 are fed by FPGAs which transfer 4:2:2 raw video frames and some extra data in 8 additional lines. The video input is handled through V4L2. The additional information contains:

  • frame counter (increments by 1 every frame)
  • timestamp of the frame
  • audio data (optional)

In the first version of our video processing SW we used L4T 28.1 and USERPTR V4L2 input buffers. The pipeline is the following:

V4L2 USERPTER buffer -> memcopy to 1 or multiple VIC output planes (MMAP) --> VIC convert to 4:2:0 + scale -> VIC capture plane (MMAP), CUDA -> ENC output plane (DMABUF) -> ENC capture plane (MMAP)

This worked as expected, except that we ran into a memory leak issue with libtegrav4l2.so. Therefore we moved to L4T 28.2. The modified pipeline:

V4L2 DMABUF -> map buffer to user space; sync to CPU; read additional data; sync to device; unmap buffer -> NvBufferTransform() + CUDA -> ENC output plane (DMABUF) -> ENC capture plane (MMAP)

This pipeline solves the memory leak issue, and basically works, except one very important issue:

  • The first 20 frames received after VIDIOC_DQBUF are completely zeros. The number of input buffers queued to V4L2 is exactly 20. This issue happens with the 28.1 pipeline running on L4T 28.2 and the new 28.2 pipeline running on L4T 28.2, but does not happen when 28.1 pipeline is running on L4T 28.1. If I increase the number of input buffers to 30, I get 30 zero frames. If this would be the only issue, I could live with it.
  • From time to time there is a jump in the received frame counter and timestamp. This happens with the 28.2 pipeline on L4T 28.2, but does not happen with the 28.1 pipeline on L4T 28.2.
    Based on the second point I would guess that the reason could be related to cache invalidation when I read V4L2 DMABUF, though I call NvBufferMemSyncForCpu() after mapping the actual input DMABUF to user pointer. Mapping with NvBufferMem_Read or NvBufferMem_Read_Write does not make a difference. It also does not make difference if I map all input buffers during allocation (and does not do map/unmap at every frame).

I logged the V4L2 buffer index, V4L2 sequence counter, received frame counter and received timestamp. On L4T 28.2 with 28.2 pipeline it looks like this:

0	0	0	0
19	19	0	0
0	20	1	64233594815
1	21	22	64237101245
2	22	23	64237268061
3	23	24	64237434580
4	24	25	64237601396
5	25	26	64237767916
6	26	7	64234601378
7	27	8	64234767897
0	40	21	64236934725
1	41	22	64237101245
2	42	23	64237268061
6	1786	1787	64531269875
7	1787	1788	64531436394
8	1788	1789	64531603210
6	1806	1825	64537603247
7	1807	1788	64531436394
8	1808	1789	64531603210
9	1809	1790	64531769730
10	1810	1791	64531936546
11	1811	1792	64532103065
12	1812	1811	64535269899
13	1813	1812	64535436419

From the log:

  • V4l2 buffer index is continuous in the range of 0…19, this is fine.
  • V4L2 sequence counter increments by 1 every frame, this is also fine.
  • Received frame counter can jump backwards and forward. This cannot be an issue in the FPGA transmitter (it cannot counts down, and the same FPGA works with 28.1). When it jumps backwards, the data is the same which was received previously, e.g. at sequence counter 1807 the data (received frame counter, timestamp) is the same as at sequence counter 1787. Both are read from input buffer 7. The situation is the same with forward jumping, e.g. sequence counters 21 and 41.

As the OV5693 does not support per-frame changing test pattern generation, I have no idea how I could easily reproduce the issue on the Jetson dev board, but our release schedule is getting really-really tight, so any help is really appreciated.

One more related question (just curiosity). In 12_camera_v4l2_cuda NvBufferMemSyncForDevice() is called between VIDIOC_DQBUF and NvBufferTransform(). Why is this synchronization required when the CSI Camera input DMAs to memory and the VIC reads from the same memory? Does CSI interface and/or VIC has its own cache which has to be flushed/invalidated?

Thanks and regards.

Did you apply the patch from below link?


NvBufferMemSyncForDevice should be the memory sync between CPU and GPU shouldn’t relative to VI/CSI dma.

Hi ShaneCCC,

I have the “[MMAPI]Cannot run NvVideoDecoder in loop/Memory leak in NvVideoEncoder” patch applied, as it is required to fix our memory leak issue.

Could you apply below patch and add “–set-ctrl low_latency_mode=1” to your capture command line.

2366750_Oct02_latency-improvements.tar.gz (20.5 KB)

Hi ShaneCCC,

I guess you were referring to this one:

I will give it a try.

Right, the patch was from Jerry.
Could you also run the jetson_clocks.sh during the test.