VI5 kthread_stop Kernel Panic - Same Crash Pattern as Thread #343830

We’re running into a kernel panic that matches the same crash pattern reported by zhang.pei.xing in Thread #343830:

https://forums.developer.nvidia.com/t/when-accessing-multiple-non-existent-cameras-the-system-will-crash/343830

Our crash signature:

WARNING: CPU: 4 PID: 4389 at kernel/kthread.c:83 kthread_stop+0x58/0x290

Unable to handle kernel paging request at virtual address 0000ffffbab130e0

Internal error: Oops: 96000006 [#1] PREEMPT SMP

Call trace:

kthread_stop+0xa4/0x290

vi5_channel_stop_kthreads+0x44/0x60

vi5_channel_stop_streaming+0x128/0x140

Kernel panic - not syncing: Oops: Fatal exception

Same function, same offset, same error code. In post #19, zhang.pei.xing confirmed that the root cause was identified and a fix was already available internally at NVIDIA. Could we get access to that patch?

The kernel panic we’re seeing is caused by a race condition in the RTCPU firmware. When multiple cameras enter error recovery concurrently, the RTCPU returns the wrong channel_id in channel_setup_resp, causing recovery to fail and ultimately crashing the kernel. This is the same root cause identified in NVIDIA Forum by Zhang.pei.xing.

We’re on Xavier NX, JP513, kernel 5.10.192-tegra #279, running 8 cameras via FPD-Link III.

Any help would be greatly appreciated. Thank you!

===================================================

======Kernel log=====================================

Message from syslogd@vision at Feb 23 21:06:32 …
kernel:[16076.942816] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[16076.939741] snd_soc_tegra210_adsp(E) snd_soc_tegra210_ahub(E) snd_soc_tegra_utils(E) snd_soc_simple_card_utils(E) nvadsp(E) tegra210_adma(E) snd_hda_codec(E) snd_hda_core(E) spi_tegra114(E) loop(E) binfmt_misc(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) ip_tables(E) x_tables(E) [last unloaded: mtd]
[16076.939924] CPU: 2 PID: 10266 Comm: symbot_server-1 Tainted: G W E 5.10.192-tegra #279
[16076.939940] Hardware name: Symbot (DT)
[16076.939953] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=–)
[16076.939970] pc : kthread_stop+0x58/0x290
[16076.939979] lr : kthread_stop+0x4c/0x290
[16076.939985] sp : ffff80001cfdba40
[16076.939991] x29: ffff80001cfdba40 x28: ffff243a75273600
[16076.940033] x27: 0000000040045613 x26: 0000000000000000
[16076.940048] x25: ffff80001cfdbd08 x24: ffff2439c31cc1f8
[16076.940063] x23: ffff243a41540000 x22: ffff80001cfdbd08
[16076.940078] x21: ffff243a8a01d8b0 x20: ffff2439c31cc550
[16076.940108] x19: ffff243a8a01d880 x18: 0000000000000010
[16076.940122] x17: 0000000000000000 x16: ffffb8bcef0e50f0
[16076.940137] x15: ffff243a41540570 x14: ffffffffffffffff
[16076.940151] x13: ffff80009cfdb667 x12: ffff80001cfdb66f
[16076.940173] x11: 0000000000000001 x10: 0000000000000ab0
[16076.940194] x9 : ffff80001cfdba30 x8 : 612d657375203b30
[16076.940209] x7 : 206e6f206e6f6974 x6 : c0000000ffffefff
[16076.940224] x5 : ffff243b3fde1978 x4 : ffffb8bcf0eb7ba8
[16076.940238] x3 : 0000000000000001 x2 : ffff243b3fde1980
[16076.940252] x1 : 0000000000000000 x0 : 0000000000408004
[16076.940267] Call trace:
[16076.940276] kthread_stop+0x58/0x290
[16076.940288] vi5_channel_stop_kthreads+0x44/0x60
[16076.940302] vi5_channel_stop_streaming+0x128/0x140
[16076.940318] tegra_channel_stop_streaming+0x3c/0x70
[16076.940328] __vb2_queue_cancel+0x40/0x220
[16076.940337] vb2_core_streamoff+0x34/0xd0
[16076.940346] vb2_streamoff+0x34/0x80
[16076.940354] vb2_ioctl_streamoff+0x58/0x70
[16076.940364] v4l_streamoff+0x40/0x50
[16076.940372] __video_do_ioctl+0x188/0x400
[16076.940380] video_usercopy+0x280/0x7e0
[16076.940388] video_ioctl2+0x40/0x100
[16076.940395] v4l2_ioctl+0x68/0x90
[16076.940406] __arm64_sys_ioctl+0xac/0xf0
[16076.940417] el0_svc_common.constprop.0+0x80/0x1d0
[16076.940426] do_el0_svc+0x38/0xc0
[16076.940439] el0_svc+0x1c/0x30
[16076.940453] el0_sync_handler+0xa8/0xb0
[16076.940472] el0_sync+0x16c/0x180
[16076.940481] —[ end trace 8cf42039d1202bf8 ]—
[16076.940680] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[16076.940956] Mem abort info:
[16076.941065] ESR = 0x96000004
[16076.941209] EC = 0x25: DABT (current EL), IL = 32 bits
[16076.941674] SET = 0, FnV = 0
[16076.941830] EA = 0, S1PTW = 0
[16076.941951] Data abort info:
[16076.942033] ISV = 0, ISS = 0x00000004
[16076.942157] CM = 0, WnR = 0
[16076.942269] user pgtable: 4k pages, 48-bit VAs, pgdp=000000017fa23000
[16076.942478] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[16076.942816] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[16076.943027] Modules linked in: veth(E) xt_nat(E) xt_tcpudp(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xt_addrtype(E) iptable_filter(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) br_netfilter(E) lzo_rle(E) lzo_compress(E) zram(E) overlay(E) snd_soc_tegra186_asrc(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra186_dspk(E) snd_soc_tegra210_iqc(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_dmic(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_i2s(E) snd_soc_tegra_pcm(E) snd_soc_tegra210_mixer(E) snd_soc_tegra210_sfc(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) snd_soc_spdif_tx(E) snd_soc_tegra_machine_driver(E) leds_gpio(E) max77620_thermal(E) ramoops(E) reed_solomon(E) realtek(E) snd_hda_codec_hdmi(E) tegra_bpmp_thermal(E) userspace_alert(E) snd_hda_tegra(E)

hello twang6,

is it a must to stay-on JP-5.1.3?
is it possible for moving forward to the latest JP-5 release version to include some fixes?

BTW, please see-also Topic 305007 for the bug fixes to address a kernel panic when recovery has failed.

Hi Jerry,

Thank you for the feedback! Unfortunately, we need to stay on JP 5.1.3 since some of our customization is based on it and it’s already mass deployed. We’ve tried the two patches from Thread #305007, but they didn’t resolve the issue for us. The fix (referenced in post #19 of Thread #343830) appears to be different from those patches – could you help us get access to that one?

Thank you!

let me attach the patch file for testing, 0001-Camera-fix-kernel-warning-after-VI-timeout.patch (1.8 KB)

Hi Jerry,

We reviewed this patch and it appears to address the buffer state WARNING during timeout, which is helpful. However, our main issue is the kernel panic caused by kthread_stop being called on an already-exited thread in vi5_channel_stop_kthreads after error recovery fails. This crash path is not covered by this patch.

From our oberservation:

The crash doesn’t happen because of wrong buffer states. The crash happens after error recovery fails. The patch touches the top of the chain. The crash happens at the bottom. I believe there are three untouched failure points between them.

At least we need fix the IVC message handling so RTCPU returns the correct channel_id when multiple cameras recover concurrently. We need fix these issues in RTCPU firmware.

Result: Error recovery succeeds → kthread doesn’t exit → camera resumes streaming → no crash at all.

Also we need one addtional fix to prevent the kernel panic when kthread exits.

File: vi5_fops.c, function tegra_channel_kthread_capture_dequeue, line 942:

// BEFORE (current code):

if (err) {

dev_err(chan->vi->dev,

“fatal: error recovery failed\n”);

break; // kthread exits, pointer still points to freed memory

}

// After fix:

if (err) {

dev_err(chan->vi->dev,

“fatal: error recovery failed\n”);

mutex_lock(&chan->stop_kthread_lock);

chan->kthread_capture_dequeue = NULL; // tell stop_kthreads “I’m already dead”

mutex_unlock(&chan->stop_kthread_lock);

break; // kthread exits safely

}

Result: STREAMOFF → vi5_channel_stop_kthreads sees NULL → skips kthread_stopno panic. Camera is dead but system stays alive.

We’d really appreciate your help on this. Thank you!

hello twang6,

just for confirmation,
you’ll need to apply 3 patches, 2 patches from Topic 305007, and also 1 patch from post #5.

Hi Jerry,

The patch seems to be working – no more kernel panic. But we’re now seeing a new issue.

From our analysis, the patch bypasses the error recovery path for timeouts by using CAPTURE_TIMEOUT instead of CAPTURE_ERROR. This means the camera never actually recovers from the error state – it just keeps re-queuing buffers and retrying. The underlying RTCPU channel ID issue is still there, we’re just avoiding the code path that triggers it.

The concern is that since the recovery/cleanup path is never called, resources may not be properly freed. We’re now hitting OOM (out of memory) after about 3 hours of operation, with 57 OOM-killer invocations in the log.

Could this be related? If the timeout path re-queues buffers indefinitely without ever cleaning up, that could explain the memory growth. Would appreciate your thoughts on this. Thank you!

hello twang6,

IIRC, there’s patch to address vi5_channel_error_recover memory leak,
please see-also below for reference.

@@ -603,6 +603,9 @@ rel_buf:
        vi5_release_buffer(chan, buf);
 }
 

+static void vi5_unit_get_device_handle(struct platform_device *pdev,
+                       uint32_t csi_stream_id, struct device **dev);
+
 static int vi5_channel_error_recover(struct tegra_channel *chan,
        bool queue_error)
 {
@@ -620,6 +623,25 @@ static int vi5_channel_error_recover(struct tegra_channel *chan,
                        dev_err(&chan->video->dev, "vi capture release failed\n");
                        goto done;
                }
+
+               /* Release capture requests */
+               if (chan->request[vi_port] != NULL) {
+                       dma_free_coherent(chan->tegra_vi_channel[vi_port]->rtcpu_dev,
+                       chan->capture_queue_depth * sizeof(struct capture_descriptor),
+                       chan->request[vi_port], chan->request_iova[vi_port]);
+               }
+               chan->request[vi_port] = NULL;
+
+               /* Release emd data buffers */
+               if (chan->emb_buf_size > 0) {
+                       struct device *vi_unit_dev;
+
+                       vi5_unit_get_device_handle(chan->vi->ndev, chan->port[0], &vi_unit_dev);
+                       dma_free_coherent(vi_unit_dev, chan->emb_buf_size,
+                               chan->emb_buf_addr, chan->emb_buf);
+                               chan->emb_buf_size = 0;
+               }
+
                vi_channel_close_ex(chan->vi_channel_id[vi_port],
                                        chan->tegra_vi_channel[vi_port]);

We’ve already applied both patches from Topic #305007 (including the memory leak fix). The OOM is happening with all 3 patches applied. Could the issue be that the timeout path (CAPTURE_TIMEOUT) never calls recovery at all, so resources accumulate over time without cleanup? Thanks!

hello twang6,

please share the steps you’ve reproduce OOM,
BTW, please aware this announcement.. JetPack 5 upcoming End-of-Life notice.

Hi Jerry,

Based on our latest testing, the patches are working great – the VI5 driver now gracefully handles hardware errors without propagating them to higher software layers.

Thank you very much for the help!