L4T kernel 35.4.1 patches

Hi,

I’ve fixed some bugs in the kernel related to initializing and error handling of multiple MIPI cameras accessed via V4L2: patches-20230130.tar.gz (10.5 KB)

Before these patches, I observed various errors, kernel memory corruption (memory access violations etc), and failure to recover from errors. I left a board streaming from multiple cameras overnight with a script that randomly turns off one camera to run the error handling code, and in the morning it was still working.

To avoid any doubt, I am releasing these patches under the terms of the GPL, just like the code they apply to (as required by the GPL). I would appreciate if NVIDIA could merge these fixes so me and other people don’t have to continually rebase them with every new release from NVIDIA.

This is a rebased and slightly updated version of my patches from R32.1 kernel bug fixes. The functional changes are:

  • I merged 2 of them that were both touching the same code
  • I removed several patches and regions of patches where the underlying code was changed to avoid the problem
  • I added the last patch which fixes a new bug

My TODO comment in “Avoid using TEMP_CHANNEL_ID for transactions with responses” should be easily addressed by somebody more familiar with the rest of that code.

Hi,
Thanks for the sharing. We will check the patches and see if these can be included in future release.

hello brian100,

I’ve went through those patches, and they seems related to error handling of multi-camera use-case via v4l2.
we would also like to check this locally with developer kits, is it possible for sharing your test approaches, for instance, how you reproduce error circumstances, and how you validate those fixes.

Yes, you are correct, that is what these patches are for. I mainly validated the fixes with some tests that exercise certain parts of the system which produced bad results before (not producing camera images, kernel panics, system hangs, etc), which no longer cause those problems afterwards. None of them are deterministic, but they caused problems relatively quickly before and I have run them for longer periods afterwards without problems.

Some of the errors are triggered by userspace opening and/or closing multiple cameras at the same time. I would write a simple program that opens as many cameras as you’ve got at the same time, and then start and stop it repeatedly. My userspace code for that is in Rust and tied up with other dependencies, so it’s hard to share in a way that’s usable for you. Modifying any generic V4L2 example to open multiple cameras is pretty straightforwards though.

Most of the other errors are triggered by the error recovery logic. Modifying a camera driver to never actually start streaming is one way to trigger those. Combining that with the previously-mentioned program that opens a bunch of V4L2 devices at the same time and then letting it attempt error recovery is likely to trigger problems. Another thing that I’ve done is starting everything normally and then turning streaming off on one camera to induce error recovery. This is my shell snippet to do the latter:

while true ; do
    date
    i2ctransfer -y -f 1 w3@0x51 0x02 0x02 1
    python3 -c 'import time; import random; time.sleep(random.uniform(0, 10))' 
done

That turns streaming off on my cameras, and then waits a random amount of time. You’ll need to modify the I2C and register addresses for any other system.

When I first wrote the patches, I tested the error recovery by unplugging GMSL cables while streaming, and then plugging them back in. With my current system I don’t have cables that are electrically safe to unplug though.

The last patch is pretty easy to validate by looking at the error messages about NULL channels which are present before but not after. I did not find anything to track the resources leaked by filp_open without filp_close directly, but logically I’m sure it would cause some problem eventually.

2 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.