VI5 kthread_stop Kernel Panic - Same Crash Pattern as Thread #343830

twang6 · March 9, 2026, 10:01pm

We’re running into a kernel panic that matches the same crash pattern reported by zhang.pei.xing in Thread #343830:

https://forums.developer.nvidia.com/t/when-accessing-multiple-non-existent-cameras-the-system-will-crash/343830

Our crash signature:

WARNING: CPU: 4 PID: 4389 at kernel/kthread.c:83 kthread_stop+0x58/0x290

Unable to handle kernel paging request at virtual address 0000ffffbab130e0

Internal error: Oops: 96000006 [#1] PREEMPT SMP

Call trace:

kthread_stop+0xa4/0x290

vi5_channel_stop_kthreads+0x44/0x60

vi5_channel_stop_streaming+0x128/0x140

Kernel panic - not syncing: Oops: Fatal exception

Same function, same offset, same error code. In post #19, zhang.pei.xing confirmed that the root cause was identified and a fix was already available internally at NVIDIA. Could we get access to that patch?

The kernel panic we’re seeing is caused by a race condition in the RTCPU firmware. When multiple cameras enter error recovery concurrently, the RTCPU returns the wrong channel_id in channel_setup_resp, causing recovery to fail and ultimately crashing the kernel. This is the same root cause identified in NVIDIA Forum by Zhang.pei.xing.

We’re on Xavier NX, JP513, kernel 5.10.192-tegra #279, running 8 cameras via FPD-Link III.

Any help would be greatly appreciated. Thank you!

===================================================

======Kernel log=====================================

Message from syslogd@vision at Feb 23 21:06:32 …
kernel:[16076.942816] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[16076.939741] snd_soc_tegra210_adsp(E) snd_soc_tegra210_ahub(E) snd_soc_tegra_utils(E) snd_soc_simple_card_utils(E) nvadsp(E) tegra210_adma(E) snd_hda_codec(E) snd_hda_core(E) spi_tegra114(E) loop(E) binfmt_misc(E) ina3221(E) pwm_fan(E) nvgpu(E) nvmap(E) ip_tables(E) x_tables(E) [last unloaded: mtd]
[16076.939924] CPU: 2 PID: 10266 Comm: symbot_server-1 Tainted: G W E 5.10.192-tegra #279
[16076.939940] Hardware name: Symbot (DT)
[16076.939953] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=–)
[16076.939970] pc : kthread_stop+0x58/0x290
[16076.939979] lr : kthread_stop+0x4c/0x290
[16076.939985] sp : ffff80001cfdba40
[16076.939991] x29: ffff80001cfdba40 x28: ffff243a75273600
[16076.940033] x27: 0000000040045613 x26: 0000000000000000
[16076.940048] x25: ffff80001cfdbd08 x24: ffff2439c31cc1f8
[16076.940063] x23: ffff243a41540000 x22: ffff80001cfdbd08
[16076.940078] x21: ffff243a8a01d8b0 x20: ffff2439c31cc550
[16076.940108] x19: ffff243a8a01d880 x18: 0000000000000010
[16076.940122] x17: 0000000000000000 x16: ffffb8bcef0e50f0
[16076.940137] x15: ffff243a41540570 x14: ffffffffffffffff
[16076.940151] x13: ffff80009cfdb667 x12: ffff80001cfdb66f
[16076.940173] x11: 0000000000000001 x10: 0000000000000ab0
[16076.940194] x9 : ffff80001cfdba30 x8 : 612d657375203b30
[16076.940209] x7 : 206e6f206e6f6974 x6 : c0000000ffffefff
[16076.940224] x5 : ffff243b3fde1978 x4 : ffffb8bcf0eb7ba8
[16076.940238] x3 : 0000000000000001 x2 : ffff243b3fde1980
[16076.940252] x1 : 0000000000000000 x0 : 0000000000408004
[16076.940267] Call trace:
[16076.940276] kthread_stop+0x58/0x290
[16076.940288] vi5_channel_stop_kthreads+0x44/0x60
[16076.940302] vi5_channel_stop_streaming+0x128/0x140
[16076.940318] tegra_channel_stop_streaming+0x3c/0x70
[16076.940328] __vb2_queue_cancel+0x40/0x220
[16076.940337] vb2_core_streamoff+0x34/0xd0
[16076.940346] vb2_streamoff+0x34/0x80
[16076.940354] vb2_ioctl_streamoff+0x58/0x70
[16076.940364] v4l_streamoff+0x40/0x50
[16076.940372] __video_do_ioctl+0x188/0x400
[16076.940380] video_usercopy+0x280/0x7e0
[16076.940388] video_ioctl2+0x40/0x100
[16076.940395] v4l2_ioctl+0x68/0x90
[16076.940406] __arm64_sys_ioctl+0xac/0xf0
[16076.940417] el0_svc_common.constprop.0+0x80/0x1d0
[16076.940426] do_el0_svc+0x38/0xc0
[16076.940439] el0_svc+0x1c/0x30
[16076.940453] el0_sync_handler+0xa8/0xb0
[16076.940472] el0_sync+0x16c/0x180
[16076.940481] —[ end trace 8cf42039d1202bf8 ]—
[16076.940680] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[16076.940956] Mem abort info:
[16076.941065] ESR = 0x96000004
[16076.941209] EC = 0x25: DABT (current EL), IL = 32 bits
[16076.941674] SET = 0, FnV = 0
[16076.941830] EA = 0, S1PTW = 0
[16076.941951] Data abort info:
[16076.942033] ISV = 0, ISS = 0x00000004
[16076.942157] CM = 0, WnR = 0
[16076.942269] user pgtable: 4k pages, 48-bit VAs, pgdp=000000017fa23000
[16076.942478] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[16076.942816] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[16076.943027] Modules linked in: veth(E) xt_nat(E) xt_tcpudp(E) xt_conntrack(E) xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xt_addrtype(E) iptable_filter(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) libcrc32c(E) br_netfilter(E) lzo_rle(E) lzo_compress(E) zram(E) overlay(E) snd_soc_tegra186_asrc(E) snd_soc_tegra210_ope(E) snd_soc_tegra186_arad(E) snd_soc_tegra186_dspk(E) snd_soc_tegra210_iqc(E) snd_soc_tegra210_afc(E) snd_soc_tegra210_mvc(E) snd_soc_tegra210_dmic(E) snd_soc_tegra210_amx(E) snd_soc_tegra210_adx(E) snd_soc_tegra210_admaif(E) snd_soc_tegra210_i2s(E) snd_soc_tegra_pcm(E) snd_soc_tegra210_mixer(E) snd_soc_tegra210_sfc(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) ghash_ce(E) sha2_ce(E) sha256_arm64(E) sha1_ce(E) snd_soc_spdif_tx(E) snd_soc_tegra_machine_driver(E) leds_gpio(E) max77620_thermal(E) ramoops(E) reed_solomon(E) realtek(E) snd_hda_codec_hdmi(E) tegra_bpmp_thermal(E) userspace_alert(E) snd_hda_tegra(E)

JerryChang · March 10, 2026, 2:11am

hello twang6,

is it a must to stay-on JP-5.1.3?
is it possible for moving forward to the latest JP-5 release version to include some fixes?

BTW, please see-also Topic 305007 for the bug fixes to address a kernel panic when recovery has failed.

twang6 · March 10, 2026, 3:13am

Hi Jerry,

Thank you for the feedback! Unfortunately, we need to stay on JP 5.1.3 since some of our customization is based on it and it’s already mass deployed. We’ve tried the two patches from Thread #305007, but they didn’t resolve the issue for us. The fix (referenced in post #19 of Thread #343830) appears to be different from those patches – could you help us get access to that one?

Thank you!

JerryChang · March 10, 2026, 8:11am

let me attach the patch file for testing, 0001-Camera-fix-kernel-warning-after-VI-timeout.patch (1.8 KB)

twang6 · March 10, 2026, 2:32pm

Hi Jerry,

We reviewed this patch and it appears to address the buffer state WARNING during timeout, which is helpful. However, our main issue is the kernel panic caused by kthread_stop being called on an already-exited thread in vi5_channel_stop_kthreads after error recovery fails. This crash path is not covered by this patch.

From our oberservation:

The crash doesn’t happen because of wrong buffer states. The crash happens after error recovery fails. The patch touches the top of the chain. The crash happens at the bottom. I believe there are three untouched failure points between them.

At least we need fix the IVC message handling so RTCPU returns the correct channel_id when multiple cameras recover concurrently. We need fix these issues in RTCPU firmware.

Result: Error recovery succeeds → kthread doesn’t exit → camera resumes streaming → no crash at all.

Also we need one addtional fix to prevent the kernel panic when kthread exits.

File: vi5_fops.c, function tegra_channel_kthread_capture_dequeue, line 942:

// BEFORE (current code):

if (err) {

dev_err(chan->vi->dev,

“fatal: error recovery failed\n”);

break; // kthread exits, pointer still points to freed memory

}

// After fix:

if (err) {

dev_err(chan->vi->dev,

“fatal: error recovery failed\n”);

mutex_lock(&chan->stop_kthread_lock);

chan->kthread_capture_dequeue = NULL; // tell stop_kthreads “I’m already dead”

mutex_unlock(&chan->stop_kthread_lock);

break; // kthread exits safely

}

Result: STREAMOFF → vi5_channel_stop_kthreads sees NULL → skips kthread_stop → no panic. Camera is dead but system stays alive.

We’d really appreciate your help on this. Thank you!

JerryChang · March 11, 2026, 1:46am

hello twang6,

just for confirmation,
you’ll need to apply 3 patches, 2 patches from Topic 305007, and also 1 patch from post #5.

twang6 · March 12, 2026, 3:57am

Hi Jerry,

The patch seems to be working – no more kernel panic. But we’re now seeing a new issue.

From our analysis, the patch bypasses the error recovery path for timeouts by using CAPTURE_TIMEOUT instead of CAPTURE_ERROR. This means the camera never actually recovers from the error state – it just keeps re-queuing buffers and retrying. The underlying RTCPU channel ID issue is still there, we’re just avoiding the code path that triggers it.

The concern is that since the recovery/cleanup path is never called, resources may not be properly freed. We’re now hitting OOM (out of memory) after about 3 hours of operation, with 57 OOM-killer invocations in the log.

Could this be related? If the timeout path re-queues buffers indefinitely without ever cleaning up, that could explain the memory growth. Would appreciate your thoughts on this. Thank you!

JerryChang · March 12, 2026, 5:30am

hello twang6,

IIRC, there’s patch to address vi5_channel_error_recover memory leak,
please see-also below for reference.

@@ -603,6 +603,9 @@ rel_buf:
        vi5_release_buffer(chan, buf);
 }
 

+static void vi5_unit_get_device_handle(struct platform_device *pdev,
+                       uint32_t csi_stream_id, struct device **dev);
+
 static int vi5_channel_error_recover(struct tegra_channel *chan,
        bool queue_error)
 {
@@ -620,6 +623,25 @@ static int vi5_channel_error_recover(struct tegra_channel *chan,
                        dev_err(&chan->video->dev, "vi capture release failed\n");
                        goto done;
                }
+
+               /* Release capture requests */
+               if (chan->request[vi_port] != NULL) {
+                       dma_free_coherent(chan->tegra_vi_channel[vi_port]->rtcpu_dev,
+                       chan->capture_queue_depth * sizeof(struct capture_descriptor),
+                       chan->request[vi_port], chan->request_iova[vi_port]);
+               }
+               chan->request[vi_port] = NULL;
+
+               /* Release emd data buffers */
+               if (chan->emb_buf_size > 0) {
+                       struct device *vi_unit_dev;
+
+                       vi5_unit_get_device_handle(chan->vi->ndev, chan->port[0], &vi_unit_dev);
+                       dma_free_coherent(vi_unit_dev, chan->emb_buf_size,
+                               chan->emb_buf_addr, chan->emb_buf);
+                               chan->emb_buf_size = 0;
+               }
+
                vi_channel_close_ex(chan->vi_channel_id[vi_port],
                                        chan->tegra_vi_channel[vi_port]);

twang6 · March 12, 2026, 5:55am

We’ve already applied both patches from Topic #305007 (including the memory leak fix). The OOM is happening with all 3 patches applied. Could the issue be that the timeout path (CAPTURE_TIMEOUT) never calls recovery at all, so resources accumulate over time without cleanup? Thanks!

JerryChang · March 12, 2026, 7:29am

hello twang6,

please share the steps you’ve reproduce OOM,
BTW, please aware this announcement.. JetPack 5 upcoming End-of-Life notice.

twang6 · March 16, 2026, 4:01pm

Hi Jerry,

Based on our latest testing, the patches are working great – the VI5 driver now gracefully handles hardware errors without propagating them to higher software layers.

Thank you very much for the help!

Topic		Replies	Views
Tegra-capture-vi timed out causes kernel panic and system reboot Jetson Thor camera , kernel	30	652	September 10, 2025
When the camera is disconnected or not connected to orin, kernel panic may occur, causing the system to reset Jetson AGX Orin camera	23	588	February 12, 2025
Camera cause kernel panic in jetpack 5.1.3 Jetson AGX Orin camera	4	235	January 2, 2025
Vi capture kernel panic Jetson Orin NX camera , kernel	2	130	March 12, 2025
Kernel crash when camera disconnected during capture Jetson Xavier NX camera	8	309	September 25, 2024
Vi5 error： Unable to handle kernel NULL pointer dereference at virtual address 00000000 Jetson AGX Xavier camera , gstreamer	26	1588	March 3, 2024
Tegra vi driver cause kernelpanic happen Jetson Thor camera , kernel , nvbugs	13	476	November 14, 2025
When I tested the camera, there was an exception Jetson Orin NX camera	8	184	September 4, 2024
When accessing multiple non-existent cameras, the system will crash Jetson AGX Orin camera , kernel , nvbugs	16	353	September 19, 2025
Kernel Crash in tegra camera driver Jetson Xavier NX camera , kernel	2	515	July 7, 2023

VI5 kthread_stop Kernel Panic - Same Crash Pattern as Thread #343830

Related topics