TX2 is extremely slow processing video frames from the v4l2 interface (onboard camera)

We’ve identified an interesting performance problem. When we ran the following function on each incoming frame stored in V4L2 mmap-ed buffer (using V4L2_MEMORY_MMAP-based capture), for a 1080p frame (1920x1080x2 bytes), this function itself took a whooping ~311ms to complete. It only happens if the code runs on the Cortex-core regardless of core frequency, but if we set the CPU affinity to the Denvor cores, the same function only takes ~27ms per frame (see the command line log at the end).

Further tests show it’s to do with directly using the mmap-ed buffer in this function as ‘src’. If we copy the mmap-ed buffer to a pre-allocated user-space buffer and use that buffer in the convert function, the conversion time immediately drops to 16ms on any Cortex core.

Further more, this only happens with the onboard camera, thus the NVidia V4L2 and nvhost driver. If we use USB camera, the conversion time is on par with the copy version.

We see the same issue in both L4T R27.1 and R28.1. Do we know what might be the cause of this issue? While we are speculating about cache invalidation, but there’s no obvious conclusion to draw here, as cache invalidation should happen in both copy and copy-less versions.

BTW, the function is found in the v4l-utils library, which is used by OpenCV for interfacing with V4L2 cameras. We initially noticed this issue when running an OpenCV-based capture.

static void v4lconvert_uyvy_to_bgr24(const unsigned char *src, unsigned char *dest,
                int width, int height, int stride)
{
        int j;

        while (--height >= 0) {
                for (j = 0; j + 1 < width; j += 2) {
                        int u = src[0];
                        int v = src[2];
                        int u1 = (((u - 128) << 7) +  (u - 128)) >> 6;
                        int rg = (((u - 128) << 1) +  (u - 128) +
                                        ((v - 128) << 2) + ((v - 128) << 1)) >> 3;
                        int v1 = (((v - 128) << 1) +  (v - 128)) >> 1;

                        *dest++ = CLIP(src[1] + u1);
                        *dest++ = CLIP(src[1] - rg);
                        *dest++ = CLIP(src[1] + v1);

                        *dest++ = CLIP(src[3] + u1);
                        *dest++ = CLIP(src[3] - rg);
                        *dest++ = CLIP(src[3] + v1);
                        src += 4;
                }
                src += stride - width * 2;
        }
}

The command line log:

nvidia@tegra-ubuntu:~$ sudo nvpmodel -m 0
nvidia@tegra-ubuntu:~$ sudo ~/jetson_clocks.sh
nvidia@tegra-ubuntu:~$ taskset -c 1 ./uyvy2bgr24
frames 30
avg read time 0ms
avg conv time 27ms
avg write time 0ms
nvidia@tegra-ubuntu:~$ taskset -c 0 ./uyvy2bgr24
frames 30
avg read time 0ms
avg conv time 311ms
avg write time 0ms

I don’t think the function specifics matter much. It would be more interesting to see the code that opens the device and maps the memory, to be honest. It’s much more likely, as you say, that something’s going on with the memory coherency – perhaps each byte fetch ends up being a separate atomic memory transaction or something like that.

Also, if you open using read() instead of mmap() interface, what is the performance?

If you have a work-around by using memcpy() or be assigning affinity, it sounds to me as if that’ll be the fastest way forward, because if it’s memory bus related, there may not be a software fix, and if there is a software fix, it may require a new kernel version … But just speculating here; the nvidia folks ought to know more about what’s going on, and I’m super curious to hear!

This issue exists in the opencv video capture function (we tested both 3.1.0 and 3.3.0), which internally uses the v4l-utils library for accessing v4l2 capture devices. Opencv requests mmap-ed capture and frame conversion to bgr24 in one call to v4l-util. To further isolate the problem, we also tried a simple v4l2 capture example found here https://gist.github.com/maxlapshin/1253534#file-capture_raw_frames-c,and added the convention function above, the same problem exists. BTW, the NVidia driver doesn’t seem to support direct read().

We think this problem is worth looking into as it affects the straight out-of-box opencv capture function and applications uses libv42.

Do you have the same timings with other A57 cores than CPU0 ? Not sure, but I think that CPU0 has to manage interrupts from interrupt controller.