Jetpack 5.1.3 kernel panic - tegra-camrtc-capture-vi tegra-capture-vi: fatal: error recovery failed

Hi NVIDIA Team,

We are currently running a large number of production kits at customer sites utilizing the tegra-video VI5 driver on a custom Xavier NX carrier board with 8x AR0234 cameras. Our current baseline is JetPack 5.1.3. (We briefly tested JetPack 5.1.6 and observed a similar issue. Also, migrating our entire application stack, dependencies, and customer deployments to 5.1.6 is not feasible at this stage).

We have recreated our customer use case by simulating 8-camera MIPI uncorr and corr errors. We were facing a kernel panic and other issues, so we analyzed the root cause. We successfully patched vi5_fops.c to prevent the kernel panics, but we are now encountering a secondary issue where the VI subsystem locks up and requires a full system reboot to recover.

Please find the dmesg log and the vi5_fops patch we have add,
dmesg_log_jp5_1_3.txt (62.7 KB)

kernel_vi5_fops_patch.txt (10.3 KB)

We would like you to review our driver fixes and provide guidance on the subsequent IVC bus lockup.

  1. Driver Stability Fixes Applied (vi5_fops.c)
    Before our patches, the rapid teardown collisions caused multiple kernel panics (Data Aborts and SLUB double-frees) inside the Nvidia driver. We applied the following fixes to vi5_fops.c to ensure Linux kernel stability regardless of user-space behavior.

    a. VFS Lifecycle & Double-Free Fix: In vi5_channel_error_recover and vi5_channel_stop_streaming, the driver was manually calling vi_channel_close_ex() and kfree(). We removed these manual calls. Instead, we now explicitly call filp_close(), which allows the kernel VFS’s delayed_fput to trigger vi_channel_release safely in the background, ensuring memory is freed exactly once.

    b. Kthread Use-After-Free Fix: If the capture kthread encountered an IVC timeout, it exited early. When GStreamer subsequently stopped the pipeline, kthread_stop() was called on a freed pointer, causing an Oops (96000004). We added get_task_struct() at thread creation and put_task_struct() after kthread_stop to preserve the task_struct memory safely.

    c. DMA & VB2 Buffer Safety: We ensured chan->request[vi_port] = NULL is explicitly set after dma_free_coherent to prevent IOMMU double-frees. We also modified the dequeue drain loop to use vb2_buffer_done(…, VB2_BUF_STATE_ERROR) directly, rather than touching hardware registers during a reset state.

  2. The VI/IVC Lockup (Post-Patch)
    With the kernel panics resolved, our 8-camera system easily survives continuous hard-fault error injections. However, the system eventually experiences the following unrecoverable failure after 1 to 2 hours:

Plaintext
May 6 16:08:27 tegra-ubuntu kernel: [32090.571427] nvmap_alloc_handle: PID 454392: gst-launch-1.0: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant.

May 7 06:58:11 tegra-ubuntu kernel: [85474.052053] tegra194-vi5 15c10000.vi: IVC control submit failed
May 7 06:58:11 tegra-ubuntu kernel: [85474.052403] tegra-camrtc-capture-vi tegra-capture-vi: vi capture setup failed
May 7 06:58:11 tegra-ubuntu kernel: [85474.052842] tegra194-vi5 15c10000.vi: IVC control submit failed
May 7 06:58:11 tegra-ubuntu kernel: [85474.053011] tegra194-vi5 15c10000.vi: csi_stream_release: failed to disable nvcsi tpg on stream 0 virtual channel 1

Once this state is reached, all VI capture setup and release calls fail permanently. We are currently forced to perform a system reboot to restore camera functionality.

Questions for the NVIDIA Team:
To ensure our production architecture is fully robust on JetPack 5.1.3, we request your guidance on the following,

Driver Patch Validation: Can you confirm if our kernel fixes (relying on filp_close() for VFS teardown and utilizing get_task_struct for the kthreads) align with the intended architecture for JetPack 5.1.3? Are these changes safe for production?

RTCPU Soft Reset: When the IVC bus enters this -512 or IVC control submit failed state, is there any debugfs node, sysfs trigger, or kernel API that allows Linux to initiate a soft-reset of the camera coprocessor (RTCPU)? We would prefer a targeted reset mechanism rather than rebooting the entire Xavier NX.

Thank you for your time and expertise.

hello Yugesh,

let me have confirmation,
here’re some bug fixes for JP-5.1.3/r35.5.0 , did you apply all of them for your use-case?

Hi JerryChang,

Thank you for getting back to us.

To confirm, yes, we have already applied all of those baseline patches to our JP 5.1.3 / r35.5.0 build:

0001-capture-ivc-fix-multi-cam-race-condition.patch

0001-vi5_fops-fix-mem-leak.patch

0001-kernel-camera-remove-csi-enent-sync-for-recover.patch

Furthermore, we also applied the subsequent 0002-vi5_fops-fix-mem-leak.patch provided by Nvidia here: Vi_capture_release causes kernel panic - #8 by JerryChang

Even with all of these official patches applied, we were still hitting race conditions, which is why we implemented the additional filp_close and get_task_struct safeguards detailed in our previous message.

Regarding the RCE Firmware Update
You mentioned trying the firmware update from Topic293597_Jun05_rce-fw.7z. However, I noticed the file provided inside is camera-rtcpu-t234-rce.img and the flash command targets the jetson-agx-orin-devkit.

As mentioned in our architecture description, our custom carrier board is based on the Xavier NX (T194 silicon), not Orin (T234).

Could you please provide the equivalent camera-rtcpu-t194-rce.img for the Xavier architecture? Once we have the correct t194 binary, we will test the -k A_rce-fw partition flash and report back.

New Enqueue Panic Discovered & Fixed
While waiting, we continued our stress testing for 6 hours and discovered a secondary, silent failure inside vi_capture_setup() that causes a NULL pointer dereference. We want to share this with you as it appears to be a bug in the lower-level Nvidia code.

If the RTCPU runs out of memory, vi_capture_setup() correctly prints an error to dmesg, but mistakenly returns 0 (Success) instead of an error code,

dmesg:
[23897.944493] tegra194-vi5 15c10000.vi: vi_capture_setup: memoryinfo ringbuffer alloc failed
[23897.944873] tegra-camrtc-capture-vi tegra-capture-vi: err_rec: successfully reset the capture channel
Because it returns 0, the tegra_channel_kthread_capture_enqueue thread wakes up and executes vi5_setup_surface(). Because the setup failed internally, capture_data is NULL, causing an immediate Data Abort:

dmesg:
[45644.519400] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[45644.519728] Mem abort info:
[45644.519950] ESR = 0x96000044
[45644.520067] EC = 0x25: DABT (current EL), IL = 32 bits
[45644.520599] SET = 0, FnV = 0
[45644.520734] EA = 0, S1PTW = 0
[45644.520867] Data abort info:
[45644.520990] ISV = 0, ISS = 0x00000044
[45644.521129] CM = 0, WnR = 1
[45644.521253] user pgtable: 4k pages, 48-bit VAs, pgdp=000000013c418000
[45644.521472] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[45644.521722] Internal error: Oops: 96000044 [#1] PREEMPT SMP
[45644.521902] Modules linked in: fuse lzo_rle lzo_compress zram ramoops reed_solomon realtek loop snd_soc_tegra186_asrc snd_soc_tegra210_ope snd_soc_tegra186_arad snd_soc_tegra210_mvc snd_soc_tegra210_afc snd_soc_tegra210_iqc snd_soc_tegra210_dmic snd_soc_tegra186_dspk nvgpu snd_soc_tegra210_mixer snd_soc_tegra210_admaif snd_soc_tegra210_adx snd_soc_tegra210_amx snd_soc_tegra210_i2s snd_soc_tegra_pcm snd_soc_tegra210_sfc aes_ce_blk crypto_simd cryptd aes_ce_cipher ghash_ce sha2_ce sha256_arm64 sha1_ce input_leds snd_soc_tegra_machine_driver leds_gpio snd_soc_spdif_tx pwm_fan snd_hda_codec_hdmi snd_soc_tegra210_adsp max77620_thermal ina3221 snd_hda_tegra snd_soc_tegra_utils snd_hda_codec snd_soc_simple_card_utils tegra_bpmp_thermal snd_soc_tegra210_ahub nvadsp tegra210_adma userspace_alert snd_hda_core spi_tegra114 binfmt_misc nvmap ip_tables x_tables [last unloaded: mtd]
[45644.576317] CPU: 0 PID: 2763 Comm: vi-output, ar02 Not tainted 5.10.192 #195
[45644.583643] Hardware name: Symbot (DT)
[45644.587328] pstate: 40c00009 (nZcv daif +PAN +UAO -TCO BTYPE=–)
[45644.593628] pc : tegra_channel_kthread_capture_enqueue+0x1ec/0x4c0
[45644.599920] lr : tegra_channel_kthread_capture_enqueue+0x1e8/0x4c0
[45644.606217] sp : ffff80002b5e3d30
[45644.609373] x29: ffff80002b5e3d30 x28: 0000000000000000
[45644.615142] x27: 0000000000000001 x26: ffff33ca0312a080
[45644.620655] x25: 00000000000004b0 x24: 0000000000000f00
[45644.626168] x23: ffff80002b5e3e28 x22: ffff33ca0312aa14
[45644.631424] x21: ffff33ca0312aad8 x20: 0000000000000000
[45644.637193] x19: ffff33ca3b4d2c00 x18: 0000007ff1000000
[45644.642703] x17: 0000000000000000 x16: 0000000000000000
[45644.648069] x15: 000000000000001e x14: 0000000000000000
[45644.653296] x13: 0000000000000000 x12: 0000000000000000
[45644.658891] x11: 0000000000000000 x10: 0000000000000000
[45644.664148] x9 : 0000000000000000 x8 : 0000000000000000
[45644.669830] x7 : 0000000000000000 x6 : ffff8000b30f1180
[45644.674998] x5 : 000000004e0f0040 x4 : 0000000000000000
[45644.680422] x3 : 0000000000000000 x2 : ffffffffffffffc0
[45644.685781] x1 : ffffd8dc0f6bee80 x0 : ffff8000b30f1000
[45644.691354] Call trace:
[45644.693574] tegra_channel_kthread_capture_enqueue+0x1ec/0x4c0
[45644.699673] kthread+0x148/0x170
[45644.702675] ret_from_fork+0x10/0x18
[45644.704457] tegra194-vi5 15c10000.vi: capture status timed out
[45644.706589] Code: a9082fe9 f9004bec 97e52303 a9482fe9 (a9007f9f)
[45644.712223] tegra-camrtc-capture-vi tegra-capture-vi: vi-output, ar0234 9-0066: uncorr_err: request timed out after 2500 ms
[45644.718575] —[ end trace 856e2b813743a52f ]—
[45644.734079] video device name=vi-output, ar0234 9-0066
[45644.735895] Kernel panic - not syncing: Oops: Fatal exception
[45644.739361] Err cam Name: vi-output, ar0234 9-0066
[45644.744477] SMP: stopping secondary CPUs
[45644.744509] Kernel Offset: 0x58dbfe210000 from 0xffff800010000000
[45644.744513] PHYS_OFFSET: 0xffffcc3700000000
[45644.744521] CPU features: 0x48240002,03802a30
[45644.744525] Memory Limit: none
[45644.770460] —[ end Kernel panic - not syncing: Oops: Fatal exception ]—

The RTCPU Watchdog / Runtime PM Deadlock (New Issue)
After applying the enqueue fix above, the kernel no longer panics during our stress tests. However, we are now facing a Runtime PM deadlock where the RTCPU crashes, but the kernel watchdog fails to reset it because the power domain is suspended:

Plaintext
[Fri May 8 04:25:06 2026] tegra186-cam-rtcpu bc00000.rtcpu: cannot reboot while suspended
[Fri May 8 04:25:06 2026] tegra186-cam-rtcpu bc00000.rtcpu: Alert: Camera RTCPU gone bad! restoring it immediately!!
[Fri May 8 04:25:06 2026] tegra186-cam-rtcpu bc00000.rtcpu: cannot reboot while suspended
[Fri May 8 04:25:06 2026] systemd-journald[259]: /dev/kmsg buffer overrun, some messages lost.
[Fri May 8 04:25:06 2026] tegra186-cam-rtcpu bc00000.rtcpu: Alert: Camera RTCPU gone bad! restoring it immediately!!
[Fri May 8 04:25:06 2026] tegra186-cam-rtcpu bc00000.rtcpu: cannot reboot while suspended

When the watchdog subsequently tries to reset the crashed RTCPU, it is blocked by the power management subsystem, leaving the cameras permanently locked up until a hard reboot.

To ensure our production architecture is fully robust on JetPack 5.1.3, we request your guidance on the following:

Driver Patch Validation: Can you confirm if our kernel fixes (relying on filp_close() for VFS teardown, utilizing get_task_struct for the kthreads, and the capture_data == NULL check) align with the intended architecture for JetPack 5.1.3? Are these changes safe for production?

RTCPU Soft Reset / RPM Deadlock: Is there a known patch for JetPack 5.1.3 that fixes this cannot reboot while suspended deadlock in camrtc-capture.c? We need the Camera RTCPU gone bad! recovery mechanism to successfully execute its soft-reset even if the VI power domain has runtime-suspended.

Thank you for your time!

Hi JerryChang,

I am following up on this message. While waiting for your feedback, we attempted to test the provided firmware and encountered the following issues,

We attempted to flash the provided Orin firmware (camera-rtcpu-t234-rce.img from Topic293597_Jun05_rce-fw.7z) onto our Xavier NX (T194).

As expected, the architecture mismatch caused the RCE to fail to boot entirely. The kernel reported a zeroed SHA1 hash, indicating the firmware was rejected:

Existing Xavier NX (T194) firmware dmesg log:
Plaintext
[ 4.094588] tegra186-cam-rtcpu bc00000.rtcpu: firmware version cpu=rce cmd=6 sha1=571b1d9f5b93980c19e16eaf250cfa6deac6f03e

New Orin (T234) firmware dmesg log:
Plaintext
[ 20.097775] tegra186-cam-rtcpu bc00000.rtcpu: firmware version cpu=rce cmd=0 sha1=0000000000000000000000000000000000000000

Because the RTCPU failed to load, attempting to stream the cameras resulted in an immediate issue, as expected. Please find the attached dmesg log,

dmesg_rce.txt (176.2 KB)

Could you please provide the correct camera-rtcpu-t194-rce.img compiled specifically for the Xavier NX architecture? Once we have the correct T194 binary, we will test it and report back.

Hi @Yugesh

Let’s handle this issue from another support channel.

Thanks