History lesson time!
Because of the way NTSC color TV was introduced as a hack on top of black-and-white, color information in video (“chroma” in vernacular – but that term also has a different technical definition) is recorded at half the bandwidth (resolution) of the brightness (“luma” in vernacular) information.
Also, because of the way the NTSC color TV works, video was split into a higher-bandwidth brightness signal, and a two-dimensional lower-bandwidth “color delta” signal. This is often referred to as YUV, YU’V’, YCbCr, and other similar names. (Again, there are vernacular uses of these names, as well as well-defined technical terms that are overlapping – too much for this post.) These are different from the typical RGB formats used by computer displays.
This separation is actually alright, because it turns out that our eyes are much more sensitive to green light than to red / blue light. Ask two evolutionary biologists why in the same room, and you may bet a number of interesting takes on this :-) (Or they might both agree, which would be boring.)
The “half bandwidth” version is typically found as YUV422, where “4” really means “4 MHz bandwidth” and “22” really means “2 MHz bandwidth for each color sub-signal.” The MHz come from the bandwidth allocation in the NTSC signals, and modern high-resolution displays use much more than that, but the ratios still persist.
Turns out, NTSC could only fit 6 MHz in a broadcast signal, so the color got further sub-sampled to the 411 format – 1 MHz each for U/V (or Cb/Cr in computer color space.) Turns out, subsampling four horizontal pixels to a single color pixel doesn’t look that great, so computer people came up with a similar allocation that instead subsamples a 2x2 pixel area, and still uses the same bandwith; this is colloquially known as “420” format. (See also: JPEG, and many other computer image representations)
OK, so to generate these signals, a typical sensor (the chip that the camera uses to see light) will arrange its three-colored pixels in an order like:
…
…
But, it turns out, manufacturing that very finely spaced sensor becomes harder and harder as chip sizes go down and megapixels go up. (I’m not a big fan of tons of megapixels; I’d rather have bigger pixels with less noise and higher dynamic range, but that doesn’t sound as sexy in marketing materials, so I lose on that.)
So, enter the Bayer pattern of sampling only two colors per row of pixels:
…
…
The camera sensor then looks at the intersection of each of the pixel boundaries, and uses the two green pixels, one red pixel, and one blue pixel that each intersection borders, to calculate the final color of the output pixel. This is approximately the same color information as a typical 422 signal, but it trades the
So, the problem is, the “raw” output of a Bayer sensor is not compatible with expected YUV422 formats that some programs may be hard-coded to expect. An external camera (such as a USB camera) that is expected to be compatible with a wide variety of pre-existing software will contain additional processing that turns the Bayer sensor data into whatever format the host application wants (typically YCbCr or RGB.) Embedded cameras (such as cell phones, and here the Jetson camera) do not add that processing on the camera board, but instead do this in the host.
Once the processing in the host is done in custom host electronics (maybe the image processing units on the Jetson, if I remember correctly?) it turns out to be harder to write a video4linux driver that gives what appears to be “raw” access to the camera, yet supports using the offload hardware to de-Bayer the image. Hence, why I think the current NVIDIA driver doesn’t expose YUV/YCbCr/RGB in the “raw” v4l2 driver.
There are two solutions to this problem:
- Do fancy footwork in the driver to provide the output of the image processing unit as a "raw" v4l2 video stream with flexible formats. This probably has all kinds of interesting internal resource allocation problems for the driver developer, and NVIDIA probably have decided to put their scarce engineering time on other features that are also important.
- Use a software Bayer -> YUV converter, and burn CPU cycles to make the software compatible. This helps users who "Must make it work" but is a terrible way of building an embedded system for a systems integrator. Given that the target for Jetson is embedded systems where the vendor controls the installable software, burning CPU cycles (and thus power) on this is probably not a priority for NVIDIA.
Now enter video4linux2, the video API for Linux that’s been around for a long time (v4l2.) It has two parts: a driver API which lets you query/define formats, and set up streams of video data buffers to capture, and a helper library that takes care of some of the arcane low-level gruntwork of talking to the drivers. (Personally, I find the driver API to be fine “raw” but many applications also use the user-level library.)
There exists a generic software format converter for v4l2. It works for applications that use the v4l2 drivers using the v4l2 API, but depending on how the application uses the device, it may or may not be compatible.
You can enable this for a particular start of a particular program by starting the program with:
LD_PRELOAD=/usr/lib/aarch64-linux-gnu/libv4l/v4l2convert.so /usr/local/bin/skype
(or whatever/wherever your program is)
This will tell the system to pre-install the “v4l2convert.so” module into the process, and this module hijacks the v4l connections it can see and attempts to match up the format expectations through software conversion.
Sometimes this works, and sometimes not.
End history lesson.