Kernel panic when closing a stuck camera

Hi Guys,

I’ve encountered kernel panic issue which happens with high possibility when closing a stuck camera stream.

You can simply reproduce it by unplugging the streaming camera, make it into abnormal timeout state, and then close the terminal or kill the process.
The use case is similar to a camera or a cable accidently broken.

The dmesg is here: panic_by_close_stopped_video.txt (13.4 KB)

last time I’ve post this issue but there is no solution:

Also I found similar issues without right solution:

Please help to find the solution, thanks.

Apply these changes to verify.

0001-capture-ivc-fix-multi-cam-race-condition.patch (2.4 KB)
0001-vi5-continue-captures-even-after-corr-errors.patch (2.3 KB)
0001-camera-vi5-fix-stream-on-off-memory-leakage.patch (6.4 KB)

Hi Shane,

It still happens, please see the dmesgs:
panic_dmesg1.txt (5.5 KB)
panic_dmesg2.txt (22.3 KB)

Try below to disable recover.

diff --git a/drivers/media/platform/tegra/camera/vi/vi5_fops.c b/drivers/media/platform/tegra/camera/vi/vi5_fops.c
index 1d6def596..8d414b654 100644
--- a/drivers/media/platform/tegra/camera/vi/vi5_fops.c
+++ b/drivers/media/platform/tegra/camera/vi/vi5_fops.c
@@ -591,6 +591,7 @@ static int vi5_channel_error_recover(struct tegra_channel *chan,
       if (queue_error)
               vb2_queue_error(&chan->queue);

+       return -1;
       /* reset nvcsi stream */
       csi_subdev = tegra_channel_find_linked_csi_subdev(chan);
       if (!csi_subdev) {

Hi Shane,

I’ve disabled the error recovery, and the kernel panic can be found in the attachment below.
One thing I notice is that If I close any 5MP camera, it will make all the other 3MP or 5MP cameras stuck. And when I close the stuck cameras, it will trigger kernel panic. However, this situation doesn’t happen before using above patches you provided.

close_affect_other_port.txt (94.3 KB)

Please help to check the patch from below link.

L4T kernel 35.4.1 patches - Jetson & Embedded Systems / Jetson Orin Nano - NVIDIA Developer Forums

Hi, author of those patches here. Yes I agree that they look like they’ll fix the problem in this thread, specifically the first patch 0001-Fix-error-recovery-for-tegra_channel_kthread_capture.patch looks very relevant.

1 Like

Thanks for your contribution. I’ll give it a try soon.

Hi @bsilver16384

It seems better than before, but I can still reproduce the kernel panic.
The dmesg can be found here:
kernel-panic-after-10-patches.txt (5.0 KB)

And the test scripts I used to reproduce the issue is attached below:
loop-v4l2.zip (633 Bytes)

Hi @ting.chang,

Unfortunately that crash does not look familiar to me, and I’m not working on anything related at the moment so I can’t go looking myself. I recommend adding prints to tegra_channel_close to figure out which pointer is NULL, and then figure out what’s supposed to happen in that situation. I’m definitely suspicious of how that interacts with the vi5_channel_error_recover function that my first patch changes, my guess is it’s violating some invariant that tegra_channel_close relies on, but I don’t see anything just looking at it.

If you’re unfamiliar with kernel development: 00000000000004f8 is a small address so it’s probably the result of dereferencing NULL. pr_err("vi: %p\n", vi); is where I’d start, and then figure out between which lines of code it crashes, and which pointer exactly is NULL. It may be in a different function that’s being inlined so it doesn’t show up in the stacktrace. Also it may be a use-after-free, which you can narrow down by adding prints where the relevant memory is allocated and freed to keep track of whether the lifetime where it’s allocated overlaps with the place it’s being used.

Hi @ShaneCCC

Do you have any plans to fix this issue?

Thank you.

Does ctrl+c to terminal the v4l2-ctl also able reproduce the issue?

Yes, initially we found this issue occasionally when using ctrl+c to close terminal. And then use scripts to kill the process to reproduce this issue quickly.

Suppose the problem didn’t show if capture successfully without go to the recover path.

If any one of the camera stream is in the recover path, close either normal or abnormal stream is able to trigger kernel panic.

We try to streaming off the sensor by sysfs to simulate the camera unplug to enter recovery mode and kill the v4l2-ctl process about 10 times but unable reproduce the problem.

Hi @ShaneCCC

10 times is not enough.
Can you use my scripts below to test for 15 minutes?
loop-v4l2.zip (633 Bytes)

After script is running, unplug the camera or whatever way to let cameras enter recovery mode.

I will try it.
And could you please modify the STREAM_COUNT to 100 to try.

I’ve tested it with STREAM_COUNT=100. In my test, it still happened when TEST_COUNT=82.

Hi @ShaneCCC

Did you succesfully reproduce the issue?