We’ve identified an interesting performance problem. When we ran the following function on each incoming frame stored in V4L2 mmap-ed buffer (using V4L2_MEMORY_MMAP-based capture), for a 1080p frame (1920x1080x2 bytes), this function itself took a whooping ~311ms to complete. It only happens if the code runs on the Cortex-core regardless of core frequency, but if we set the CPU affinity to the Denvor cores, the same function only takes ~27ms per frame (see the command line log at the end).
Further tests show it’s to do with directly using the mmap-ed buffer in this function as ‘src’. If we copy the mmap-ed buffer to a pre-allocated user-space buffer and use that buffer in the convert function, the conversion time immediately drops to 16ms on any Cortex core.
Further more, this only happens with the onboard camera, thus the NVidia V4L2 and nvhost driver. If we use USB camera, the conversion time is on par with the copy version.
We see the same issue in both L4T R27.1 and R28.1. Do we know what might be the cause of this issue? While we are speculating about cache invalidation, but there’s no obvious conclusion to draw here, as cache invalidation should happen in both copy and copy-less versions.
BTW, the function is found in the v4l-utils library, which is used by OpenCV for interfacing with V4L2 cameras. We initially noticed this issue when running an OpenCV-based capture.
static void v4lconvert_uyvy_to_bgr24(const unsigned char *src, unsigned char *dest,
int width, int height, int stride)
{
int j;
while (--height >= 0) {
for (j = 0; j + 1 < width; j += 2) {
int u = src[0];
int v = src[2];
int u1 = (((u - 128) << 7) + (u - 128)) >> 6;
int rg = (((u - 128) << 1) + (u - 128) +
((v - 128) << 2) + ((v - 128) << 1)) >> 3;
int v1 = (((v - 128) << 1) + (v - 128)) >> 1;
*dest++ = CLIP(src[1] + u1);
*dest++ = CLIP(src[1] - rg);
*dest++ = CLIP(src[1] + v1);
*dest++ = CLIP(src[3] + u1);
*dest++ = CLIP(src[3] - rg);
*dest++ = CLIP(src[3] + v1);
src += 4;
}
src += stride - width * 2;
}
}
The command line log:
nvidia@tegra-ubuntu:~$ sudo nvpmodel -m 0
nvidia@tegra-ubuntu:~$ sudo ~/jetson_clocks.sh
nvidia@tegra-ubuntu:~$ taskset -c 1 ./uyvy2bgr24
frames 30
avg read time 0ms
avg conv time 27ms
avg write time 0ms
nvidia@tegra-ubuntu:~$ taskset -c 0 ./uyvy2bgr24
frames 30
avg read time 0ms
avg conv time 311ms
avg write time 0ms