TX2 is extremely slow processing video frames from the v4l2 interface (onboard camera)

rong1129 · October 28, 2017, 7:04am

We’ve identified an interesting performance problem. When we ran the following function on each incoming frame stored in V4L2 mmap-ed buffer (using V4L2_MEMORY_MMAP-based capture), for a 1080p frame (1920x1080x2 bytes), this function itself took a whooping ~311ms to complete. It only happens if the code runs on the Cortex-core regardless of core frequency, but if we set the CPU affinity to the Denvor cores, the same function only takes ~27ms per frame (see the command line log at the end).

Further tests show it’s to do with directly using the mmap-ed buffer in this function as ‘src’. If we copy the mmap-ed buffer to a pre-allocated user-space buffer and use that buffer in the convert function, the conversion time immediately drops to 16ms on any Cortex core.

Further more, this only happens with the onboard camera, thus the NVidia V4L2 and nvhost driver. If we use USB camera, the conversion time is on par with the copy version.

We see the same issue in both L4T R27.1 and R28.1. Do we know what might be the cause of this issue? While we are speculating about cache invalidation, but there’s no obvious conclusion to draw here, as cache invalidation should happen in both copy and copy-less versions.

BTW, the function is found in the v4l-utils library, which is used by OpenCV for interfacing with V4L2 cameras. We initially noticed this issue when running an OpenCV-based capture.

static void v4lconvert_uyvy_to_bgr24(const unsigned char *src, unsigned char *dest,
                int width, int height, int stride)
{
        int j;

        while (--height >= 0) {
                for (j = 0; j + 1 < width; j += 2) {
                        int u = src[0];
                        int v = src[2];
                        int u1 = (((u - 128) << 7) +  (u - 128)) >> 6;
                        int rg = (((u - 128) << 1) +  (u - 128) +
                                        ((v - 128) << 2) + ((v - 128) << 1)) >> 3;
                        int v1 = (((v - 128) << 1) +  (v - 128)) >> 1;

                        *dest++ = CLIP(src[1] + u1);
                        *dest++ = CLIP(src[1] - rg);
                        *dest++ = CLIP(src[1] + v1);

                        *dest++ = CLIP(src[3] + u1);
                        *dest++ = CLIP(src[3] - rg);
                        *dest++ = CLIP(src[3] + v1);
                        src += 4;
                }
                src += stride - width * 2;
        }
}

The command line log:

nvidia@tegra-ubuntu:~$ sudo nvpmodel -m 0
nvidia@tegra-ubuntu:~$ sudo ~/jetson_clocks.sh
nvidia@tegra-ubuntu:~$ taskset -c 1 ./uyvy2bgr24
frames 30
avg read time 0ms
avg conv time 27ms
avg write time 0ms
nvidia@tegra-ubuntu:~$ taskset -c 0 ./uyvy2bgr24
frames 30
avg read time 0ms
avg conv time 311ms
avg write time 0ms

snarky · October 28, 2017, 4:10pm

I don’t think the function specifics matter much. It would be more interesting to see the code that opens the device and maps the memory, to be honest. It’s much more likely, as you say, that something’s going on with the memory coherency – perhaps each byte fetch ends up being a separate atomic memory transaction or something like that.

Also, if you open using read() instead of mmap() interface, what is the performance?

If you have a work-around by using memcpy() or be assigning affinity, it sounds to me as if that’ll be the fastest way forward, because if it’s memory bus related, there may not be a software fix, and if there is a software fix, it may require a new kernel version … But just speculating here; the nvidia folks ought to know more about what’s going on, and I’m super curious to hear!

rong1129 · October 28, 2017, 8:17pm

This issue exists in the opencv video capture function (we tested both 3.1.0 and 3.3.0), which internally uses the v4l-utils library for accessing v4l2 capture devices. Opencv requests mmap-ed capture and frame conversion to bgr24 in one call to v4l-util. To further isolate the problem, we also tried a simple v4l2 capture example found here v4l2 capture example · GitHub added the convention function above, the same problem exists. BTW, the NVidia driver doesn’t seem to support direct read().

We think this problem is worth looking into as it affects the straight out-of-box opencv capture function and applications uses libv42.

Honey_Patouceul · October 28, 2017, 9:47pm

Do you have the same timings with other A57 cores than CPU0 ? Not sure, but I think that CPU0 has to manage interrupts from interrupt controller.

Topic		Replies	Views
Performance optimization help Jetson TX2	19	1100	October 18, 2021
MMAPI's 12_camera_v4l2_cuda time-consuming question Jetson TX1	7	1209	October 18, 2021
How to get the exact image capture timestamp using v4l2 driver in user space Jetson Xavier NX kernel	12	1710	October 18, 2021
V4L2 capture jitter problem Jetson AGX Xavier camera	29	1923	October 18, 2021
Video converter with many cameras (Tegra APIs) Jetson TX2	18	2601	October 18, 2021
Open V4L2SRC YV12 60FPS camera on TX2 with opencv Jetson TX2	6	1028	October 18, 2021
12_camera_v4l2_cuda problem with mPCIe V4L2 device Jetson TX2 mmapi	12	537	January 22, 2024
Jitter-free processing of V4L2 MMAP frames Jetson Xavier NX camera	7	402	March 13, 2024
NvBuffer to cpu memory Jetson TX2	10	969	October 18, 2021
Two frame latency/delay in TX2 V4L stack Jetson TX2 camera	4	934	October 18, 2021

TX2 is extremely slow processing video frames from the v4l2 interface (onboard camera)

Related topics