Hi NVIDIA Team,
We are currently running a large number of production kits at customer sites utilizing the tegra-video VI5 driver on a custom Xavier NX carrier board with 8x AR0234 cameras. Our current baseline is JetPack 5.1.3. (We briefly tested JetPack 5.1.6 and observed a similar issue. Also, migrating our entire application stack, dependencies, and customer deployments to 5.1.6 is not feasible at this stage).
We have recreated our customer use case by simulating 8-camera MIPI uncorr and corr errors. We were facing a kernel panic and other issues, so we analyzed the root cause. We successfully patched vi5_fops.c to prevent the kernel panics, but we are now encountering a secondary issue where the VI subsystem locks up and requires a full system reboot to recover.
Please find the dmesg log and the vi5_fops patch we have add,
dmesg_log_jp5_1_3.txt (62.7 KB)
kernel_vi5_fops_patch.txt (10.3 KB)
We would like you to review our driver fixes and provide guidance on the subsequent IVC bus lockup.
-
Driver Stability Fixes Applied (vi5_fops.c)
Before our patches, the rapid teardown collisions caused multiple kernel panics (Data Aborts and SLUB double-frees) inside the Nvidia driver. We applied the following fixes to vi5_fops.c to ensure Linux kernel stability regardless of user-space behavior.a. VFS Lifecycle & Double-Free Fix: In vi5_channel_error_recover and vi5_channel_stop_streaming, the driver was manually calling vi_channel_close_ex() and kfree(). We removed these manual calls. Instead, we now explicitly call filp_close(), which allows the kernel VFS’s delayed_fput to trigger vi_channel_release safely in the background, ensuring memory is freed exactly once.
b. Kthread Use-After-Free Fix: If the capture kthread encountered an IVC timeout, it exited early. When GStreamer subsequently stopped the pipeline, kthread_stop() was called on a freed pointer, causing an Oops (96000004). We added get_task_struct() at thread creation and put_task_struct() after kthread_stop to preserve the task_struct memory safely.
c. DMA & VB2 Buffer Safety: We ensured chan->request[vi_port] = NULL is explicitly set after dma_free_coherent to prevent IOMMU double-frees. We also modified the dequeue drain loop to use vb2_buffer_done(…, VB2_BUF_STATE_ERROR) directly, rather than touching hardware registers during a reset state.
-
The VI/IVC Lockup (Post-Patch)
With the kernel panics resolved, our 8-camera system easily survives continuous hard-fault error injections. However, the system eventually experiences the following unrecoverable failure after 1 to 2 hours:
Plaintext
May 6 16:08:27 tegra-ubuntu kernel: [32090.571427] nvmap_alloc_handle: PID 454392: gst-launch-1.0: WARNING: All NvMap Allocations must have a tag to identify the subsystem allocating memory.Please pass the tag to the API call NvRmMemHanldeAllocAttr() or relevant.
…
May 7 06:58:11 tegra-ubuntu kernel: [85474.052053] tegra194-vi5 15c10000.vi: IVC control submit failed
May 7 06:58:11 tegra-ubuntu kernel: [85474.052403] tegra-camrtc-capture-vi tegra-capture-vi: vi capture setup failed
May 7 06:58:11 tegra-ubuntu kernel: [85474.052842] tegra194-vi5 15c10000.vi: IVC control submit failed
May 7 06:58:11 tegra-ubuntu kernel: [85474.053011] tegra194-vi5 15c10000.vi: csi_stream_release: failed to disable nvcsi tpg on stream 0 virtual channel 1
Once this state is reached, all VI capture setup and release calls fail permanently. We are currently forced to perform a system reboot to restore camera functionality.
Questions for the NVIDIA Team:
To ensure our production architecture is fully robust on JetPack 5.1.3, we request your guidance on the following,
Driver Patch Validation: Can you confirm if our kernel fixes (relying on filp_close() for VFS teardown and utilizing get_task_struct for the kthreads) align with the intended architecture for JetPack 5.1.3? Are these changes safe for production?
RTCPU Soft Reset: When the IVC bus enters this -512 or IVC control submit failed state, is there any debugfs node, sysfs trigger, or kernel API that allows Linux to initiate a soft-reset of the camera coprocessor (RTCPU)? We would prefer a targeted reset mechanism rather than rebooting the entire Xavier NX.
Thank you for your time and expertise.