In our system the MIPI inputs of the Tegra TX2 are fed by FPGAs which transfer 4:2:2 raw video frames and some extra data in 8 additional lines. The video input is handled through V4L2. The additional information contains:
- frame counter (increments by 1 every frame)
- timestamp of the frame
- audio data (optional)
In the first version of our video processing SW we used L4T 28.1 and USERPTR V4L2 input buffers. The pipeline is the following:
V4L2 USERPTER buffer -> memcopy to 1 or multiple VIC output planes (MMAP) --> VIC convert to 4:2:0 + scale -> VIC capture plane (MMAP), CUDA -> ENC output plane (DMABUF) -> ENC capture plane (MMAP)
This worked as expected, except that we ran into a memory leak issue with libtegrav4l2.so. Therefore we moved to L4T 28.2. The modified pipeline:
V4L2 DMABUF -> map buffer to user space; sync to CPU; read additional data; sync to device; unmap buffer -> NvBufferTransform() + CUDA -> ENC output plane (DMABUF) -> ENC capture plane (MMAP)
This pipeline solves the memory leak issue, and basically works, except one very important issue:
- The first 20 frames received after VIDIOC_DQBUF are completely zeros. The number of input buffers queued to V4L2 is exactly 20. This issue happens with the 28.1 pipeline running on L4T 28.2 and the new 28.2 pipeline running on L4T 28.2, but does not happen when 28.1 pipeline is running on L4T 28.1. If I increase the number of input buffers to 30, I get 30 zero frames. If this would be the only issue, I could live with it.
- From time to time there is a jump in the received frame counter and timestamp. This happens with the 28.2 pipeline on L4T 28.2, but does not happen with the 28.1 pipeline on L4T 28.2.
Based on the second point I would guess that the reason could be related to cache invalidation when I read V4L2 DMABUF, though I call NvBufferMemSyncForCpu() after mapping the actual input DMABUF to user pointer. Mapping with NvBufferMem_Read or NvBufferMem_Read_Write does not make a difference. It also does not make difference if I map all input buffers during allocation (and does not do map/unmap at every frame).
I logged the V4L2 buffer index, V4L2 sequence counter, received frame counter and received timestamp. On L4T 28.2 with 28.2 pipeline it looks like this:
0 0 0 0 ... 19 19 0 0 0 20 1 64233594815 1 21 22 64237101245 2 22 23 64237268061 3 23 24 64237434580 4 24 25 64237601396 5 25 26 64237767916 6 26 7 64234601378 7 27 8 64234767897 ... 0 40 21 64236934725 1 41 22 64237101245 2 42 23 64237268061 ... 6 1786 1787 64531269875 7 1787 1788 64531436394 8 1788 1789 64531603210 ... 6 1806 1825 64537603247 7 1807 1788 64531436394 8 1808 1789 64531603210 9 1809 1790 64531769730 10 1810 1791 64531936546 11 1811 1792 64532103065 12 1812 1811 64535269899 13 1813 1812 64535436419
From the log:
- V4l2 buffer index is continuous in the range of 0…19, this is fine.
- V4L2 sequence counter increments by 1 every frame, this is also fine.
- Received frame counter can jump backwards and forward. This cannot be an issue in the FPGA transmitter (it cannot counts down, and the same FPGA works with 28.1). When it jumps backwards, the data is the same which was received previously, e.g. at sequence counter 1807 the data (received frame counter, timestamp) is the same as at sequence counter 1787. Both are read from input buffer 7. The situation is the same with forward jumping, e.g. sequence counters 21 and 41.
As the OV5693 does not support per-frame changing test pattern generation, I have no idea how I could easily reproduce the issue on the Jetson dev board, but our release schedule is getting really-really tight, so any help is really appreciated.
One more related question (just curiosity). In 12_camera_v4l2_cuda NvBufferMemSyncForDevice() is called between VIDIOC_DQBUF and NvBufferTransform(). Why is this synchronization required when the CSI Camera input DMAs to memory and the VIC reads from the same memory? Does CSI interface and/or VIC has its own cache which has to be flushed/invalidated?
Thanks and regards.